Please complete all core assignment tasks. Optional tasks do not carry any points but are highly recommended.
The goals of this HW include the following:
Please consult Canvas for the grading rubric (check Module 05 -> HW 05).
Use the starter code below to calculate the first iteration of gradient descent to minimize the objective function $2x^2+ 4x$. We assume that the initial guess is $x=2$ of and a stepsize of 0.1
#==================================================#
# Your code starts here #
#==================================================#
# TODO - change above to :
def step(x, eta):
update = eta * (4*x+4) #Insert f prime here
return x - update
step(x=2, eta=0.1)
#==================================================#
# Your code ends here #
#==================================================#
0.7999999999999998
Assume you are learning a linear regression model (using the mean squared error objective loss function) via stochastic gradient descent with the following setting:
import numpy as np
#==================================================#
# Your code starts here #
#==================================================#
# TODO - change above to :
X= [1,2,3]
W = [1, 0, 1]
y=[6]
alpha= 0.1
dotp=np.dot(W,X)
gradient= (dotp-y)*X
newW= W - alpha*gradient
print(",".join([str(i) for i in np.round(newW,3)]))
#==================================================#
# Your code ends here #
#==================================================#
1.2,0.4,1.6
Given a linear regression model with parameters θ where $θ_0$ corresponds to the bias term:
import numpy as np
#==================================================#
# Your code starts here #
#==================================================#
# Feel free to adopt the following code base
#TODO - change above to :
X=np.array([1, 5, 1, 2])
w=np.array([2,-1, 4, 8])
eta = .01
y = 6
w=w - eta * (np.dot(w,X)-y) * (X)
print(",".join([str(i) for i in np.round(w,3)]))
#==================================================#
# Your code ends here #
#==================================================#
1.89,-1.55,3.89,7.78
We have seen this dataset previously while working with KNN Regression. In this notebook, we're going to build a different regression model for predicting house prices in thousands of dollars given factors such as crime rate in neighborhood, number of schools, % lower status of the population, etc.
Import required libraries
import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set(style="whitegrid", font_scale=1.3)
matplotlib.rcParams["legend.framealpha"] = 1
matplotlib.rcParams["legend.frameon"] = True
Fix random seed for reproducibility
np.random.seed(42)
Boston dataset is extremely common in machine learning experiments thus it is embedded in sklearn.
boston = load_boston()
Detailed description of dataset and features
print(boston.DESCR)
.. _boston_dataset:
Boston house prices dataset
---------------------------
**Data Set Characteristics:**
:Number of Instances: 506
:Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.
:Attribute Information (in order):
- CRIM per capita crime rate by town
- ZN proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS proportion of non-retail business acres per town
- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX nitric oxides concentration (parts per 10 million)
- RM average number of rooms per dwelling
- AGE proportion of owner-occupied units built prior to 1940
- DIS weighted distances to five Boston employment centres
- RAD index of accessibility to radial highways
- TAX full-value property-tax rate per $10,000
- PTRATIO pupil-teacher ratio by town
- B 1000(Bk - 0.63)^2 where Bk is the proportion of black people by town
- LSTAT % lower status of the population
- MEDV Median value of owner-occupied homes in $1000's
:Missing Attribute Values: None
:Creator: Harrison, D. and Rubinfeld, D.L.
This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/
This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.
The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980. N.B. Various transformations are used in the table on
pages 244-261 of the latter.
The Boston house-price data has been used in many machine learning papers that address regression
problems.
.. topic:: References
- Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
- Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
Create pandas dataframe with objects in rows and features in columns
X = pd.DataFrame(boston.data, columns=boston.feature_names)
y = boston.target
X.head()
| CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1.0 | 296.0 | 15.3 | 396.90 | 4.98 |
| 1 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2.0 | 242.0 | 17.8 | 396.90 | 9.14 |
| 2 | 0.02729 | 0.0 | 7.07 | 0.0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2.0 | 242.0 | 17.8 | 392.83 | 4.03 |
| 3 | 0.03237 | 0.0 | 2.18 | 0.0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3.0 | 222.0 | 18.7 | 394.63 | 2.94 |
| 4 | 0.06905 | 0.0 | 2.18 | 0.0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3.0 | 222.0 | 18.7 | 396.90 | 5.33 |
All features are numerical, but note that some features are categorical (e.g., CHAS) while others are continuous.
X.describe()
| CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 |
| mean | 3.613524 | 11.363636 | 11.136779 | 0.069170 | 0.554695 | 6.284634 | 68.574901 | 3.795043 | 9.549407 | 408.237154 | 18.455534 | 356.674032 | 12.653063 |
| std | 8.601545 | 23.322453 | 6.860353 | 0.253994 | 0.115878 | 0.702617 | 28.148861 | 2.105710 | 8.707259 | 168.537116 | 2.164946 | 91.294864 | 7.141062 |
| min | 0.006320 | 0.000000 | 0.460000 | 0.000000 | 0.385000 | 3.561000 | 2.900000 | 1.129600 | 1.000000 | 187.000000 | 12.600000 | 0.320000 | 1.730000 |
| 25% | 0.082045 | 0.000000 | 5.190000 | 0.000000 | 0.449000 | 5.885500 | 45.025000 | 2.100175 | 4.000000 | 279.000000 | 17.400000 | 375.377500 | 6.950000 |
| 50% | 0.256510 | 0.000000 | 9.690000 | 0.000000 | 0.538000 | 6.208500 | 77.500000 | 3.207450 | 5.000000 | 330.000000 | 19.050000 | 391.440000 | 11.360000 |
| 75% | 3.677083 | 12.500000 | 18.100000 | 0.000000 | 0.624000 | 6.623500 | 94.075000 | 5.188425 | 24.000000 | 666.000000 | 20.200000 | 396.225000 | 16.955000 |
| max | 88.976200 | 100.000000 | 27.740000 | 1.000000 | 0.871000 | 8.780000 | 100.000000 | 12.126500 | 24.000000 | 711.000000 | 22.000000 | 396.900000 | 37.970000 |
We will start by creating a scatterplot matrix that will allow us to visualize the pair-wise relationships and correlations between the different features. It is also quite useful to have a quick overview of how the data is distributed and wheter it cointains or not outliers.
DataFrame.copy(self, deep=True)[source]
Make a copy of this object’s indices and data.
When deep=True (default), a new object will be created with a copy of the calling object’s data and indices. Modifications to the data or indices of the copy will not be reflected in the original object (see notes below).
#make deep copy of the X dataframe so we can append the target variable MEDV
X_y= X.copy()
X_y["MEDV"] = y
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# Calculate and show pairplot
# NOTE: This might take a minute or two
sns.pairplot(X_y, size=2.5)
plt.tight_layout()
/usr/local/lib/python3.7/site-packages/seaborn/axisgrid.py:2076: UserWarning: The `size` parameter has been renamed to `height`; please update your code. warnings.warn(msg, UserWarning)
We can spot a linear relationship between ‘RM’ and House prices ‘MEDV’. In addition, we can infer from the histogram that the ‘MEDV’ variable seems to be normally distributed but contain several outliers.
Let's also take a look into correlation matrix of features
# compute the correlation matrix
corr = X.corr()
# generate a mask for the lower triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
# set up the matplotlib figure
f, ax = plt.subplots(figsize=(18, 18))
# generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)
# draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3,
square=True,
linewidths=.5, cbar_kws={"shrink": .5}, ax=ax);
Let's split data into a train and test subsets so we can avoid ROTE learning.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
There are lots of feature. Let's visualize two of them across the train and test data.
plt.figure(figsize=(18, 8))
plt.subplot(121)
plt.scatter(X_train.RM, y_train, label="Train")
plt.scatter(X_test.RM, y_test, c="r", label="Test")
plt.xlabel("Average number of rooms per dwelling")
plt.ylabel("Price, $")
plt.legend(loc="lower right", frameon=True)
plt.subplot(122)
plt.scatter(X_train.RAD, y_train, label="Train")
plt.scatter(X_test.RAD, y_test, c="r", label="Test")
plt.xlabel("Index of accessibility to radial highways")
plt.ylabel("Price, $")
plt.legend(loc="lower right");
Normalize data in the range $(0,1)$ to make our data insensitive to the scale of features.
scaler = MinMaxScaler()
Note that we're going to learn normalization constants only on training set. That's done because the assumption is that test set is unreachable during training.
X_train = scaler.fit_transform(X_train)
Transform test set with the same constants
X_test.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 102 entries, 173 to 75 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CRIM 102 non-null float64 1 ZN 102 non-null float64 2 INDUS 102 non-null float64 3 CHAS 102 non-null float64 4 NOX 102 non-null float64 5 RM 102 non-null float64 6 AGE 102 non-null float64 7 DIS 102 non-null float64 8 RAD 102 non-null float64 9 TAX 102 non-null float64 10 PTRATIO 102 non-null float64 11 B 102 non-null float64 12 LSTAT 102 non-null float64 dtypes: float64(13) memory usage: 11.2 KB
X_test = scaler.transform(X_test)
type(X)
pandas.core.frame.DataFrame
#y.hist()
X.describe()
| CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 |
| mean | 3.613524 | 11.363636 | 11.136779 | 0.069170 | 0.554695 | 6.284634 | 68.574901 | 3.795043 | 9.549407 | 408.237154 | 18.455534 | 356.674032 | 12.653063 |
| std | 8.601545 | 23.322453 | 6.860353 | 0.253994 | 0.115878 | 0.702617 | 28.148861 | 2.105710 | 8.707259 | 168.537116 | 2.164946 | 91.294864 | 7.141062 |
| min | 0.006320 | 0.000000 | 0.460000 | 0.000000 | 0.385000 | 3.561000 | 2.900000 | 1.129600 | 1.000000 | 187.000000 | 12.600000 | 0.320000 | 1.730000 |
| 25% | 0.082045 | 0.000000 | 5.190000 | 0.000000 | 0.449000 | 5.885500 | 45.025000 | 2.100175 | 4.000000 | 279.000000 | 17.400000 | 375.377500 | 6.950000 |
| 50% | 0.256510 | 0.000000 | 9.690000 | 0.000000 | 0.538000 | 6.208500 | 77.500000 | 3.207450 | 5.000000 | 330.000000 | 19.050000 | 391.440000 | 11.360000 |
| 75% | 3.677083 | 12.500000 | 18.100000 | 0.000000 | 0.624000 | 6.623500 | 94.075000 | 5.188425 | 24.000000 | 666.000000 | 20.200000 | 396.225000 | 16.955000 |
| max | 88.976200 | 100.000000 | 27.740000 | 1.000000 | 0.871000 | 8.780000 | 100.000000 | 12.126500 | 24.000000 | 711.000000 | 22.000000 | 396.900000 | 37.970000 |
# the histogram of the data
n, bins, patches = plt.hist(y, 50, density=True, facecolor='green', alpha=0.75)
# add a 'best fit' line
#l = plt.plot(bins, y, 'r--', linewidth=1)
plt.xlabel('House prices')
plt.ylabel('Probability')
plt.title(r'Histogram of Median value of owner-occupied homes in $1,000s')
#plt.axis([40, 160, 0, 0.03])
plt.grid(True)
plt.show()
%matplotlib inline
import matplotlib.pyplot as plt
X.hist(bins=50, figsize=(20,15))
plt.show()
Here we use very simple Linear Regression model. Scikit-learn uses the closed-form solition for Linear Regression problem thus it gives very good results.
model_sk = LinearRegression()
Fitting model on prepared data
model_sk.fit(X_train, y_train)
LinearRegression()
Let's see what features are significant for the model. Largest coefficients will have greatest impact on the model.
plt.figure(figsize=(12, 8))
plt.bar(np.arange(model_sk.coef_.shape[0]) - 0.4, model_sk.coef_)
plt.xticks(np.arange(model_sk.coef_.shape[0]), X.columns, rotation='vertical')
plt.xlim([-1, model_sk.coef_.shape[0]])
plt.title("Sklearn model coefficients");
Predicting both train and test sets to evaluate model
preds_test = model_sk.predict(X_test)
preds_train = model_sk.predict(X_train)
There is no MAPE implementation in sklearn (because this metric is undefined when real value is zero). Below one can find my own implementation.
The mean absolute percentage error (MAPE), also known as mean absolute percentage deviation (MAPD), is a measure of prediction accuracy of a forecasting method in statistics, for example in trend estimation. It usually expresses accuracy as a percentage, and is defined by the formula:
$${\displaystyle {\mbox{M}}={\frac {100\%}{n}}\sum _{t=1}^{n}\left|{\frac {A_{t}-F_{t}}{A_{t}}}\right|,} $$where At is the actual value and Ft is the forecast value.
The difference between At and Ft is divided by the Actual value At again. The absolute value in this calculation is summed for every forecasted point in time and divided by the number of fitted points n. Multiplying by 100% makes it a percentage error.
Although the concept of MAPE sounds very simple and convincing, it has major drawbacks in practical application
It cannot be used if there are zero values (which sometimes happens for example in demand data) because there would be a division by zero. For forecasts which are too low the percentage error cannot exceed 100%, but for forecasts which are too high there is no upper limit to the percentage error. When MAPE is used to compare the accuracy of prediction methods it is biased in that it will systematically select a method whose forecasts are too low. This little-known but serious issue can be overcome by using an accuracy measure based on the ratio of the predicted to actual value (called the Accuracy Ratio), this approach leads to superior statistical properties and leads to predictions which can be interpreted in terms of the geometric mean.
For more details on MAPE please see here
def mean_absolute_percentage_error(y_true, y_pred):
"""
Use of this metric is not recommended because can cause division by zero
See other regression metrics on sklearn docs:
http://scikit-learn.org/stable/modules/classes.html#regression-metrics
Use like any other metric
>>> y_true = [3, -0.5, 2, 7]; y_pred = [2.5, -0.3, 2, 8]
>>> mean_absolute_percentage_error(y_true, y_pred)
Out[]: 24.791666666666668
"""
return np.mean(np.abs((y_true.ravel() - y_pred.ravel()) / y_true.ravel())) * 100
Let's evaluate our model according to three different metrics:
metrics = [mean_absolute_error,
lambda y_true, y_pred: mean_squared_error(y_true, y_pred) ** 0.5,
mean_absolute_percentage_error]
metrics_names = ["MAE",
"RMSE",
"MAPE"]
Also we want to check quality on both train and test sets
samples = [(X_train, y_train),
(X_test, y_test)]
models_names = ["Sklearn"]
Let's do it in loop
models_names = ["Sklearn"]
def evaluate(models, metrics, samples, metrics_names, models_names):
results = np.zeros((len(samples) * len(models), len(metrics)))
samples_names = []
for m in models_names:
samples_names.extend([m + " Train", m + " Test"])
for m_num, model in enumerate(models):
for row, sample in enumerate(samples):
for col, metric in enumerate(metrics):
results[row + m_num * 2, col] = np.round(metric(sample[1], model.predict(sample[0])), decimals=3)
results = pd.DataFrame(results, columns=metrics_names, index=samples_names)
return results
models = [model_sk]
Evaluated metrics:
evaluate(models, metrics, samples, metrics_names, models_names)
| MAE | RMSE | MAPE | |
|---|---|---|---|
| Sklearn Train | 3.315 | 4.652 | 16.575 |
| Sklearn Test | 3.189 | 4.929 | 16.866 |
It also interesting to take a look how the predicted points relate to real ones. All the points should lie on the black dotted line ($y=x$) assuming that our model is perfect
plt.figure(figsize=(10, 8))
plt.scatter(y_train, preds_train, label="Train")
plt.scatter(y_test, preds_test, c="r", label="Test")
plt.xlabel("Real price, $")
plt.ylabel("Predicted price, $")
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', lw=3, label="Ideal")
plt.legend(loc="lower right");
The common method to evaluate the model is cross-validation. The idea behind it is to divide the whole set of objects into $k$ sections and then use one section as a test set and other $k-1$ as a train (repeat it with all the sections).
There is a special function for this in sklearn called $\text{KFold}$. It creates set of indices for cross-validation.
cv = KFold(n_splits=5, shuffle=True, random_state=42)
Next step is to do everything that we've done before in a loop:
And store the average value of the errors ($\text{res}$ variable)
cv_idx = list(cv.split(X_train, y_train))
res = None
for train_idx, test_idx in cv_idx:
# split
X_train, X_test = X.values[train_idx], X.values[test_idx]
y_train, y_test = y[train_idx], y[test_idx]
# scale
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
samples_cv = [(X_train, y_train),
(X_test, y_test)]
# fit
model_sk_cv = LinearRegression().fit(samples_cv[0][0], samples_cv[0][1])
# evaluate
if res is None:
res = evaluate([model_sk_cv], metrics, samples_cv, metrics_names, ["Sklearn CV"])
else:
res += evaluate([model_sk_cv], metrics, samples_cv, metrics_names, ["Sklearn CV"])
# take the average value across all folds
# show total residuals before taking average of 5 splits
res /= cv.n_splits
Here is the result of CV
res
| MAE | RMSE | MAPE | |
|---|---|---|---|
| Sklearn CV Train | 3.2916 | 4.7324 | 15.4034 |
| Sklearn CV Test | 3.4710 | 5.0788 | 16.1954 |
Using the Boston dataset from above (train, and blind test split).
Core Assignment: Cross-validation with k=10 and stdev.
Repeat the above experiment but change the code so that you can compute the standard deviation of the various metrics of interest (i.e., report MAE and stdMAE, RMSE and stdRMSE etc.) via cross-validation with k=10.
boston = load_boston()
X = pd.DataFrame(boston.data, columns=boston.feature_names)
y = boston.target
def evaluate_std(models, metrics, samples, metrics_names, models_names):
scores = np.zeros((len(samples) * len(models), len(metrics)))
for m_num, model in enumerate(models):
for row, sample in enumerate(samples):
for col, metric in enumerate(metrics):
scores[row + m_num * 2, col] = metric(sample[1], model.predict(sample[0]))
return scores
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
np.random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
cv = KFold(n_splits=10, shuffle=True, random_state=42)
cv_idx = list(cv.split(X_train, y_train))
res = []
for train_idx, test_idx in cv_idx:
# split
X_train, X_test = X.values[train_idx], X.values[test_idx]
y_train, y_test = y[train_idx], y[test_idx]
# build pipeline using standard scaler and linear regression
# with default settings for both
#==================================================#
# Your code starts here #
#==================================================#
pipe = Pipeline([('scalar',StandardScaler()),('linreg',LinearRegression())])
#==================================================#
# Your code ends here #
#==================================================#
samples_cv = [(X_train, y_train),
(X_test, y_test)]
# fit
model_sk_cv = pipe.fit(samples_cv[0][0], samples_cv[0][1])
# evaluate
res.append(evaluate_std([model_sk_cv], metrics, samples_cv, metrics_names, ["Sklearn CV"]))
models_names = ["Sklearn"]
samples_names = []
for m in models_names:
samples_names.extend([m + " Train", m + " Test"])
# extract results from KFold tests. e.g.,
# results[0] contains five MAE train scores
# results[3] contains five MAE test scores
results=[]
for s in range(len(samples_names)):
for m in range(len(metrics_names)):
calc = []
for r in range(len(res)):
calc.append(res[r][s][m])
results.append(calc)
#build results table
metrics_names_std = []
for m in metrics_names:
metrics_names_std.extend([m, m+"_STD"])
table = pd.DataFrame(index=samples_names, columns=metrics_names_std)
#compute stats and populate table
count=0
for s in range(len(samples_names)):
for c in range(len(metrics_names)):
table.iloc[s,c*2] = np.round(np.mean(results[count]), decimals=3)
# calculate std deviation of the results
#==================================================#
# Your code starts here #
#==================================================#
table.iloc[s,(c*2)+1] = np.round(np.std(results[count]), decimals=3)
#==================================================#
# Your code ends here #
# Please don't add code below here #
#==================================================#
count+=1
table
| MAE | MAE_STD | RMSE | RMSE_STD | MAPE | MAPE_STD | |
|---|---|---|---|---|---|---|
| Sklearn Train | 3.303 | 0.075 | 4.757 | 0.106 | 15.413 | 0.345 |
| Sklearn Test | 3.463 | 0.43 | 4.974 | 0.942 | 16.116 | 2.334 |
The question provides a lot of background and sample code that you should review, run and extend to get familar with this problem and data set. Then you should tackle the tasks posed at the end of this question.
Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city. Using these systems, people are able rent a bike from a one location and return it to a different place on an as-needed basis. Currently, there are over 500 bike-sharing programs around the world [as of 2014].
The data generated by these systems makes them attractive for researchers because the duration of travel, departure location, arrival location, and time elapsed is explicitly recorded. Bike sharing systems therefore function as a sensor network, which can be used for studying mobility in a city. In this competition, participants are asked to combine historical usage patterns with weather data in order to forecast bike rental demand in the Capital Bikeshare program in Washington, D.C.
Overview on the project:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='darkgrid')
train=pd.read_csv('Kaggle-Bike-Sharing-Demand-Challenge/train.csv')
test=pd.read_csv('Kaggle-Bike-Sharing-Demand-Challenge/test.csv')
Understanding the Data Set The dataset shows hourly rental data for two years (2011 and 2012). The training data set is for the first 19 days of each month. The test dataset is from 20th day to month’s end. We are required to predict the total count of bikes rented during each hour covered by the test set.
In the training data set, they have separately given bike demand by registered, casual users and sum of both is given as count.
Training data set has 12 variables (see below) and Test has 9 (excluding registered, casual and count).
Independent Variables
datetime: date and hour in "mm/dd/yyyy hh:mm" format
season: Four categories-> 1 = spring, 2 = summer, 3 = fall, 4 = winter
holiday: whether the day is a holiday or not (1/0)
workingday: whether the day is neither a weekend nor holiday (1/0)
weather: Four Categories of weather
1-> Clear, Few clouds, Partly cloudy, Partly cloudy
2-> Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
3-> Light Snow and Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
4-> Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
temp: hourly temperature in Celsius
atemp: "feels like" temperature in Celsius
humidity: relative humidity
windspeed: wind speed
Dependent Variables
registered: number of registered user
casual: number of non-registered user
count: number of total rentals (registered + casual)
Notice the Training dataset has 3 target features:
registered: number of registered user
casual: number of non-registered user
count: number of total rentals (registered + casual)
Here we will focus on predicting count feature initially.
print("TRAIN", train.shape)
print("TEST", test.shape)
TRAIN (10886, 12) TEST (6493, 9)
train.head()
| datetime | season | holiday | workingday | weather | temp | atemp | humidity | windspeed | casual | registered | count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2011-01-01 00:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 81 | 0.0 | 3 | 13 | 16 |
| 1 | 2011-01-01 01:00:00 | 1 | 0 | 0 | 1 | 9.02 | 13.635 | 80 | 0.0 | 8 | 32 | 40 |
| 2 | 2011-01-01 02:00:00 | 1 | 0 | 0 | 1 | 9.02 | 13.635 | 80 | 0.0 | 5 | 27 | 32 |
| 3 | 2011-01-01 03:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 75 | 0.0 | 3 | 10 | 13 |
| 4 | 2011-01-01 04:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 75 | 0.0 | 0 | 1 | 1 |
test.head()
| datetime | season | holiday | workingday | weather | temp | atemp | humidity | windspeed | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2011-01-20 00:00:00 | 1 | 0 | 1 | 1 | 10.66 | 11.365 | 56 | 26.0027 |
| 1 | 2011-01-20 01:00:00 | 1 | 0 | 1 | 1 | 10.66 | 13.635 | 56 | 0.0000 |
| 2 | 2011-01-20 02:00:00 | 1 | 0 | 1 | 1 | 10.66 | 13.635 | 56 | 0.0000 |
| 3 | 2011-01-20 03:00:00 | 1 | 0 | 1 | 1 | 10.66 | 12.880 | 56 | 11.0014 |
| 4 | 2011-01-20 04:00:00 | 1 | 0 | 1 | 1 | 10.66 | 12.880 | 56 | 11.0014 |
train.isnull().sum()
datetime 0 season 0 holiday 0 workingday 0 weather 0 temp 0 atemp 0 humidity 0 windspeed 0 casual 0 registered 0 count 0 dtype: int64
train.describe() #summary statistics for each variable
| season | holiday | workingday | weather | temp | atemp | humidity | windspeed | casual | registered | count | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 10886.000000 | 10886.000000 | 10886.000000 | 10886.000000 | 10886.00000 | 10886.000000 | 10886.000000 | 10886.000000 | 10886.000000 | 10886.000000 | 10886.000000 |
| mean | 2.506614 | 0.028569 | 0.680875 | 1.418427 | 20.23086 | 23.655084 | 61.886460 | 12.799395 | 36.021955 | 155.552177 | 191.574132 |
| std | 1.116174 | 0.166599 | 0.466159 | 0.633839 | 7.79159 | 8.474601 | 19.245033 | 8.164537 | 49.960477 | 151.039033 | 181.144454 |
| min | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 0.82000 | 0.760000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| 25% | 2.000000 | 0.000000 | 0.000000 | 1.000000 | 13.94000 | 16.665000 | 47.000000 | 7.001500 | 4.000000 | 36.000000 | 42.000000 |
| 50% | 3.000000 | 0.000000 | 1.000000 | 1.000000 | 20.50000 | 24.240000 | 62.000000 | 12.998000 | 17.000000 | 118.000000 | 145.000000 |
| 75% | 4.000000 | 0.000000 | 1.000000 | 2.000000 | 26.24000 | 31.060000 | 77.000000 | 16.997900 | 49.000000 | 222.000000 | 284.000000 |
| max | 4.000000 | 1.000000 | 1.000000 | 4.000000 | 41.00000 | 45.455000 | 100.000000 | 56.996900 | 367.000000 | 886.000000 | 977.000000 |
train.head()
| datetime | season | holiday | workingday | weather | temp | atemp | humidity | windspeed | casual | registered | count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2011-01-01 00:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 81 | 0.0 | 3 | 13 | 16 |
| 1 | 2011-01-01 01:00:00 | 1 | 0 | 0 | 1 | 9.02 | 13.635 | 80 | 0.0 | 8 | 32 | 40 |
| 2 | 2011-01-01 02:00:00 | 1 | 0 | 0 | 1 | 9.02 | 13.635 | 80 | 0.0 | 5 | 27 | 32 |
| 3 | 2011-01-01 03:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 75 | 0.0 | 3 | 10 | 13 |
| 4 | 2011-01-01 04:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 75 | 0.0 | 0 | 1 | 1 |
%matplotlib inline
import matplotlib.pyplot as plt
train.hist(bins=50, figsize=(20,15))
plt.show()
In computer programming, pandas is a software library written for the Python programming language for data manipulation and analysis. Data is stored in tabular format. In particular, it offers data structures and operations for manipulating numerical tables and time series.
The Panda Library features include:
For more info on Pandas dataframes see the following:
E.g.,
train[['count', 'holiday']].groupby(['holiday'], as_index = True).mean().sort_values(by = 'count')
| count | |
|---|---|
| holiday | |
| 1 | 185.877814 |
| 0 | 191.741655 |
train[['count', 'season']].groupby(['season'], as_index = True).mean().sort_values(by = 'count')
| count | |
|---|---|
| season | |
| 1 | 116.343261 |
| 4 | 198.988296 |
| 2 | 215.251372 |
| 3 | 234.417124 |
weather: Four Categories of weather
1-> Clear, Few clouds, Partly cloudy, Partly cloudy
2-> Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
3-> Light Snow and Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
4-> Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
train.season.unique()
array([1, 2, 3, 4])
train.weather.value_counts()
1 7192 2 2834 3 859 4 1 Name: weather, dtype: int64
train.holiday.value_counts()
0 10575 1 311 Name: holiday, dtype: int64
sns.barplot(x='season', y='count', data=train)
<AxesSubplot:xlabel='season', ylabel='count'>
sns.barplot(x='weather', y='count', data=train)
<AxesSubplot:xlabel='weather', ylabel='count'>
Features like season, holiday, weather and working day are in numerical form here. Having said that the numerical values associated with feature like weather dont have any particular numeric ordering. As such, it is much better to one hot encode these features. Here we can do this manually (via code) for training and testing datasets. Later in this course we automate this process in a more principled manner (we have to deal with corner cases.
# Create a new column called df.weather1 where the value is 1
# if df.weather is 1 and 0 if not
train['weather1'] = np.where(train['weather']==1, 1, 0) #ifelse assignment
train['weather2'] = np.where(train['weather']==2, 1, 0)
train['weather3'] = np.where(train['weather']==3, 1, 0)
train['weather4'] = np.where(train['weather']==4, 1, 0)
# dont forget to drop the original feature
train.drop(['weather'], axis=1, inplace=True)
train.iloc[[7000]] ## Weather has a value of 2
#train.as_matrix()[322:330,]
| datetime | season | holiday | workingday | temp | atemp | humidity | windspeed | casual | registered | count | weather1 | weather2 | weather3 | weather4 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 7000 | 2012-04-10 00:00:00 | 2 | 0 | 1 | 18.04 | 21.97 | 47 | 19.0012 | 3 | 23 | 26 | 0 | 1 | 0 | 0 |
Dont forget to OHE the test data!! ..........
Hint: Season? Others?
train["hour"] = [t.hour for t in pd.DatetimeIndex(train.datetime)]
train["day"] = [t.dayofweek for t in pd.DatetimeIndex(train.datetime)]
train["month"] = [t.month for t in pd.DatetimeIndex(train.datetime)]
train['year'] = [t.year for t in pd.DatetimeIndex(train.datetime)]
train['year'] = train['year'].map({2011:0, 2012:1}) #OHE: One-hot-encoding (OHE): based
Why separate input features and target feature for TRAIN only?
Answer: The test set is for creating a Kaggle submission.
X, y = train.iloc[:, 1:], train['count']
X = train.drop(['datetime','registered', 'casual', 'count'], axis=1)
X.iloc[0:2,:]
| season | holiday | workingday | temp | atemp | humidity | windspeed | weather1 | weather2 | weather3 | weather4 | hour | day | month | year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 0 | 9.84 | 14.395 | 81 | 0.0 | 1 | 0 | 0 | 0 | 0 | 5 | 1 | 0 |
| 1 | 1 | 0 | 0 | 9.02 | 13.635 | 80 | 0.0 | 1 | 0 | 0 | 0 | 1 | 5 | 1 | 0 |
It was seen from the training data sum of registered column and casual column yields count. It was unnecessary to keep these two columns as our features, Machine learning learners can be more fruitful if dataset is free of useless columns.
plt.scatter(x = train['casual'] + train['registered'], y = train['count'])
plt.show()
Split the training data into train and a blind test set using scikit's train_test_split package. Rememeber that the test set that was provided is for Kaggle submission only and as such has not target values.
np.random.seed(42)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=0)
Features on larger scales can unduly influence the model. We want features on a similar scale. Scikit's preprocessing provides us with StandardScaler package to scale our data.
from sklearn.preprocessing import StandardScaler
scl= StandardScaler()
X_train_std = scl.fit_transform(X_train)
X_test_std = scl.transform(X_test)
lin_reg = LinearRegression()
lin_reg.fit(X_train_std, y_train) #SKLearn's linear regression
y_train_pred = lin_reg.predict(X_train_std)
y_test_pred = lin_reg.predict(X_test_std)
from sklearn.metrics import mean_squared_error, r2_score
#Root_Mean_Square Error(RMSE) is performance criteria for this problem
# errors on count
print('RMSE train: %.3f' % np.sqrt(mean_squared_error(y_train, y_train_pred)))
print('RMSE test: %.3f' % np.sqrt(mean_squared_error(y_test, y_test_pred)))
print('R2 train: %.3f' % r2_score(y_train, y_train_pred))
print('R2 test: %.3f' % r2_score(y_test, y_test_pred))
RMSE train: 141.620 RMSE test: 140.759 R2 train: 0.391 R2 test: 0.390
import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
t = np.arange(0.01, 20.0, 0.1)
# log x and y axis
#plt.subplot(223)
plt.loglog(t, 20*np.exp(-t/10.0))# basex=2)
plt.grid(True)
plt.title('loglog base 2 on x')
Text(0.5, 1.0, 'loglog base 2 on x')
expLog = pd.DataFrame(columns=["Model description", "RMSE train", 'RMSE test','R2 test'])
# add the result of this experiment to the log book
expLog.loc[len(expLog)] = ["SKLearn Linear Regression vanilla",
np.round(np.sqrt(mean_squared_error(y_train, y_train_pred)),3),
np.round(np.sqrt(mean_squared_error(y_test, y_test_pred)),3),
np.round(r2_score(y_test, y_test_pred), 3)
]
expLog
| Model description | RMSE train | RMSE test | R2 test | |
|---|---|---|---|---|
| 0 | SKLearn Linear Regression vanilla | 141.62 | 140.759 | 0.39 |
Let's learn a DecisionTreeRegressor model for comparison purposes.
from sklearn.tree import DecisionTreeRegressor
dt = DecisionTreeRegressor()
dt.fit(X_train_std, y_train)
y_train_pred = dt.predict(X_train_std)
y_test_pred = dt.predict(X_test_std)
#Root_Mean_Square_Error(RMSE) is accuracy criteria for this problem
# errors on count
print('RMSE train: %.3f' % np.sqrt(mean_squared_error(y_train, y_train_pred)))
print('RMSE test: %.3f' % np.sqrt(mean_squared_error(y_test, y_test_pred)))
print('R2 train: %.3f' % r2_score(y_train, y_train_pred))
print('R2 test: %.3f' % r2_score(y_test, y_test_pred))
expLog.loc[len(expLog)] = ["SKLearn DecisionTree vanilla",
np.round(np.sqrt(mean_squared_error(y_train, y_train_pred)),3),
np.round(np.sqrt(mean_squared_error(y_test, y_test_pred)),3),
np.round(r2_score(y_test, y_test_pred), 3)
]
expLog
RMSE train: 0.276 RMSE test: 62.488 R2 train: 1.000 R2 test: 0.880
| Model description | RMSE train | RMSE test | R2 test | |
|---|---|---|---|---|
| 0 | SKLearn Linear Regression vanilla | 141.620 | 140.759 | 0.39 |
| 1 | SKLearn DecisionTree vanilla | 0.276 | 62.488 | 0.88 |
Learn Decision Tree ensemble, RandomForestRegressor, model.
from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor(n_estimators = 400, criterion='mse',random_state=1, n_jobs=-1)
forest.fit(X_train_std, y_train)
y_train_pred = forest.predict(X_train_std)
y_test_pred = forest.predict(X_test_std)
from sklearn.metrics import mean_squared_error, r2_score
#Root_Mean_Square_Log_Error(RMSE) is accuracy criteria for this problem
# errors on count
print('RMSE train: %.3f' % np.sqrt(mean_squared_error(y_train, y_train_pred)))
print('RMSE test: %.3f' % np.sqrt(mean_squared_error(y_test, y_test_pred)))
print('R2 train: %.3f' % r2_score(y_train, y_train_pred))
print('R2 test: %.3f' % r2_score(y_test, y_test_pred))
expLog.loc[len(expLog)] = ["SKLearn RandomForestRegressor vanilla",
np.round(np.sqrt(mean_squared_error(y_train, y_train_pred)),3),
np.round(np.sqrt(mean_squared_error(y_test, y_test_pred)),3),
np.round(r2_score(y_test, y_test_pred), 3)
]
expLog
RMSE train: 14.998 RMSE test: 42.656 R2 train: 0.993 R2 test: 0.944
| Model description | RMSE train | RMSE test | R2 test | |
|---|---|---|---|---|
| 0 | SKLearn Linear Regression vanilla | 141.620 | 140.759 | 0.390 |
| 1 | SKLearn DecisionTree vanilla | 0.276 | 62.488 | 0.880 |
| 2 | SKLearn RandomForestRegressor vanilla | 14.998 | 42.656 | 0.944 |
test=pd.read_csv('./Kaggle-Bike-Sharing-Demand-Challenge/test.csv')
test.head()
| datetime | season | holiday | workingday | weather | temp | atemp | humidity | windspeed | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2011-01-20 00:00:00 | 1 | 0 | 1 | 1 | 10.66 | 11.365 | 56 | 26.0027 |
| 1 | 2011-01-20 01:00:00 | 1 | 0 | 1 | 1 | 10.66 | 13.635 | 56 | 0.0000 |
| 2 | 2011-01-20 02:00:00 | 1 | 0 | 1 | 1 | 10.66 | 13.635 | 56 | 0.0000 |
| 3 | 2011-01-20 03:00:00 | 1 | 0 | 1 | 1 | 10.66 | 12.880 | 56 | 11.0014 |
| 4 | 2011-01-20 04:00:00 | 1 | 0 | 1 | 1 | 10.66 | 12.880 | 56 | 11.0014 |
# Create a new column called df.weather1 where the value is 1
# if df.weather is 1 and 0 if not
test['weather1'] = np.where(test['weather']==1, 1, 0) #ifelse assignment
test['weather2'] = np.where(test['weather']==2, 1, 0)
test['weather3'] = np.where(test['weather']==3, 1, 0)
test['weather4'] = np.where(test['weather']==4, 1, 0)
# dont forget to drop the original feature
test.drop(['weather'], axis=1, inplace=True)
test["hour"] = [t.hour for t in pd.DatetimeIndex(test.datetime)]
test["day"] = [t.dayofweek for t in pd.DatetimeIndex(test.datetime)]
test["month"] = [t.month for t in pd.DatetimeIndex(test.datetime)]
test['year'] = [t.year for t in pd.DatetimeIndex(test.datetime)]
test['year'] = test['year'].map({2011:0, 2012:1})
test.head()
| datetime | season | holiday | workingday | temp | atemp | humidity | windspeed | weather1 | weather2 | weather3 | weather4 | hour | day | month | year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2011-01-20 00:00:00 | 1 | 0 | 1 | 10.66 | 11.365 | 56 | 26.0027 | 1 | 0 | 0 | 0 | 0 | 3 | 1 | 0 |
| 1 | 2011-01-20 01:00:00 | 1 | 0 | 1 | 10.66 | 13.635 | 56 | 0.0000 | 1 | 0 | 0 | 0 | 1 | 3 | 1 | 0 |
| 2 | 2011-01-20 02:00:00 | 1 | 0 | 1 | 10.66 | 13.635 | 56 | 0.0000 | 1 | 0 | 0 | 0 | 2 | 3 | 1 | 0 |
| 3 | 2011-01-20 03:00:00 | 1 | 0 | 1 | 10.66 | 12.880 | 56 | 11.0014 | 1 | 0 | 0 | 0 | 3 | 3 | 1 | 0 |
| 4 | 2011-01-20 04:00:00 | 1 | 0 | 1 | 10.66 | 12.880 | 56 | 11.0014 | 1 | 0 | 0 | 0 | 4 | 3 | 1 | 0 |
X_test=test.iloc[:,1:]
Similarly, use same standarad scaler for test data
X_test = scl.transform(X_test)
y_test=forest.predict(X_test)
df_submit = test
df_submit['count'] = np.round(y_test)
df_submit = df_submit.drop(['season', 'holiday', 'workingday','temp', 'atemp', 'humidity',
'windspeed', 'hour', 'day', 'month', 'year', 'weather1', 'weather2',
'weather3', 'weather4'], axis=1)
df_submit.head()
| datetime | count | |
|---|---|---|
| 0 | 2011-01-20 00:00:00 | 11.0 |
| 1 | 2011-01-20 01:00:00 | 6.0 |
| 2 | 2011-01-20 02:00:00 | 4.0 |
| 3 | 2011-01-20 03:00:00 | 4.0 |
| 4 | 2011-01-20 04:00:00 | 3.0 |
This is a post processing step to insure that all negative prediction values (we can not have negative bike rentals) are clipped to the value 0.
How many negative counts does your model predict? Plot a histogram using the code below.
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import math
# the histogram of the data
n, bins, patches = plt.hist(df_submit['count'], 50)
plt.xlabel('Bike rental count buckets')
plt.ylabel('Frequency of this count bucker')
plt.title(r'Histogram of bike rental counts (predicted)')
plt.grid(True)
plt.show()
df_submit['count'] = np.where(df_submit['count'] < 0, 0, df_submit['count'])
df_submit.to_csv('bike2.csv', index=False)
This question (this entire section) so far has provided a lot of background and sample code that you should review, run and extend to get familar with this problem and data set. Next up you should tackle the tasks posed here. Please feel free to adapt the above code to accomplish the following:
Overall, we want you to build a linear regression model using pipelines that builds on what was discussed earlier in this section and do a Kaggle submission for this Bike Demand prediction problem. In particular, we want you to address the following:
Derive other features
Please do feature selection using SelectKBest
train=pd.read_csv('Kaggle-Bike-Sharing-Demand-Challenge/train.csv')
train.head(3)
| datetime | season | holiday | workingday | weather | temp | atemp | humidity | windspeed | casual | registered | count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2011-01-01 00:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 81 | 0.0 | 3 | 13 | 16 |
| 1 | 2011-01-01 01:00:00 | 1 | 0 | 0 | 1 | 9.02 | 13.635 | 80 | 0.0 | 8 | 32 | 40 |
| 2 | 2011-01-01 02:00:00 | 1 | 0 | 0 | 1 | 9.02 | 13.635 | 80 | 0.0 | 5 | 27 | 32 |
# Create a new column called df.season1 where the value is 1
# if df.season is 1 and 0 if not
train['season1'] = np.where(train['season']==1, 1, 0) #ifelse assignment
train['season2'] = np.where(train['season']==2, 1, 0)
train['season3'] = np.where(train['season']==3, 1, 0)
train['season4'] = np.where(train['season']==4, 1, 0)
# dont forget to drop the original feature
train.drop(['season'], axis=1, inplace=True)
# Create a new column called df.weather1 where the value is 1
# if df.weather is 1 and 0 if not
train['weather1'] = np.where(train['weather']==1, 1, 0) #ifelse assignment
train['weather2'] = np.where(train['weather']==2, 1, 0)
train['weather3'] = np.where(train['weather']==3, 1, 0)
train['weather4'] = np.where(train['weather']==4, 1, 0)
# dont forget to drop the original feature
train.drop(['weather'], axis=1, inplace=True)
train["hour"] = [t.hour for t in pd.DatetimeIndex(train.datetime)]
train["day"] = [t.dayofweek for t in pd.DatetimeIndex(train.datetime)]
train["month"] = [t.month for t in pd.DatetimeIndex(train.datetime)]
train['year'] = [t.year for t in pd.DatetimeIndex(train.datetime)]
train['year'] = train['year'].map({2011:0, 2012:1})
X_train = train.iloc[:,1:].copy()
X_train.drop(columns=['casual', 'registered', 'count'], inplace=True)
y_train = train['count'].copy()
np.random.seed(42)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)
print('Train (x,y):\t', X_train.shape,'\t',y_train.shape)
print('Validate (x,y):\t', X_val.shape,'\t',y_val.shape)
Train (x,y): (8708, 18) (8708,) Validate (x,y): (2178, 18) (2178,)
from sklearn.feature_selection import SelectKBest, f_regression
np.random.seed(42)
# k is an arbitrary number of features
# we will use our pipeline below to select based on p values
# use f_regression, k=5, and X_train and y_train
#==================================================#
# Your code starts here #
#==================================================#
# TODO - change above to
Kbest = SelectKBest(f_regression, k=5)
X_train_sel = Kbest.fit_transform(X_train, y_train)
#==================================================#
# Your code ends here #
#==================================================#
featureScores = Kbest.scores_
pd.DataFrame({'feature':X_train.axes[1], 'F statistic':Kbest.scores_,'p value':Kbest.pvalues_}).sort_values(by='p value')
| feature | F statistic | p value | |
|---|---|---|---|
| 14 | hour | 1673.869106 | 0.000000e+00 |
| 2 | temp | 1561.939217 | 2.423997e-314 |
| 3 | atemp | 1524.175960 | 2.262803e-307 |
| 4 | humidity | 979.907570 | 6.178436e-204 |
| 17 | year | 628.233428 | 6.245283e-134 |
| 6 | season1 | 498.940898 | 1.626842e-107 |
| 16 | month | 241.552239 | 9.476690e-54 |
| 8 | season3 | 149.919769 | 3.446001e-34 |
| 12 | weather3 | 123.668940 | 1.548798e-28 |
| 10 | weather1 | 103.116591 | 4.303202e-24 |
| 5 | windspeed | 85.854126 | 2.402134e-20 |
| 7 | season2 | 48.650845 | 3.281687e-12 |
| 11 | weather2 | 16.777718 | 4.240448e-05 |
| 9 | season4 | 6.069154 | 1.377548e-02 |
| 1 | workingday | 1.614551 | 2.038872e-01 |
| 0 | holiday | 0.578783 | 4.468104e-01 |
| 15 | day | 0.212617 | 6.447355e-01 |
| 13 | weather4 | 0.023224 | 8.788804e-01 |
from sklearn.base import BaseEstimator, TransformerMixin
# find the indices to the top ranking input features based on importance
# arr is an array of feature importances in order of the input data matrix X
# k is the number of features to be selected
#
# Note: this feature selector assumes that you have already computed the feature importances
# We did so in the cell above with the feature importance scores stored in Kbest.scores_
def indices_of_top_k(arr, k):
return np.sort(np.argpartition(np.array(arr), -k)[-k:])
class TopFeatureSelector(BaseEstimator, TransformerMixin):
def __init__(self, feature_importances, k):
self.feature_importances = feature_importances
self.k = k
def fit(self, X, y=None):
self.feature_indices_ = indices_of_top_k(self.feature_importances, self.k)
return self
# select the columns of data matrix of the selected input features
def transform(self, X):
return X[:, self.feature_indices_]
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
linreg = LinearRegression()
pipe = Pipeline([
("MinMaxScaler", MinMaxScaler()),
('feature_selection', TopFeatureSelector(featureScores, 13)), # k=13 based on p values above
('linreg', LinearRegression())
])
pipe.fit(X_train, y_train)
y_val_pred = pipe.predict(X_val)
plt.figure(figsize=(8,3))
plt.subplot(121)
plt.hist(y_val_pred)
plt.title('Validation Set Predictions')
plt.subplot(122)
plt.hist(y_val)
plt.title('Validation Set Actual');
# set a lower bound for predicted values at zero
# and set all predictions to integer (count)
y_val_pred = np.where(y_val_pred < 0, 0, y_val_pred.astype(int))
That's a little more realistic, but still doesn't appear to be distributed like our ground truth.
plt.figure(figsize=(8,3))
plt.subplot(121)
plt.hist(y_val_pred)
plt.title('Validation Set Predictions')
print(y_val_pred[:10])
[230 37 196 280 396 316 194 285 181 29]
Generate statistics on the predictions.
expLog = pd.DataFrame(columns=['Model Description', 'MAE', 'RMSE', "MAPE"])
mae = np.round(mean_absolute_error(y_val, y_val_pred), 3)
rmse = np.round(np.sqrt(mean_squared_error(y_val, y_val_pred)), 3)
#==================================================#
# Your code starts here #
#==================================================#
# TODO - change above to
mape = np.round(mean_absolute_percentage_error(y_val, y_val_pred), 3)
#==================================================#
# Your code ends here #
#==================================================#
expLog.loc[len(expLog)] = ['SKLearn LinReg SelectKBest=13', mae, rmse, mape]
expLog
| Model Description | MAE | RMSE | MAPE | |
|---|---|---|---|---|
| 0 | SKLearn LinReg SelectKBest=13 | 102.898 | 139.806 | 295.101 |
test=pd.read_csv('./Kaggle-Bike-Sharing-Demand-Challenge/test.csv')
# Create a new column called df.weather1 where the value is 1
# if df.weather is 1 and 0 if not
test['weather1'] = np.where(test['weather']==1, 1, 0) #ifelse assignment
test['weather2'] = np.where(test['weather']==2, 1, 0)
test['weather3'] = np.where(test['weather']==3, 1, 0)
test['weather4'] = np.where(test['weather']==4, 1, 0)
# dont forget to drop the original feature
test.drop(['weather'], axis=1, inplace=True)
# Create a new column called df.season1 where the value is 1
# if df.season is 1 and 0 if not
test['season1'] = np.where(test['season']==1, 1, 0) #ifelse assignment
test['season2'] = np.where(test['season']==2, 1, 0)
test['season3'] = np.where(test['season']==3, 1, 0)
test['season4'] = np.where(test['season']==4, 1, 0)
# dont forget to drop the original feature
test.drop(['season'], axis=1, inplace=True)
test["hour"] = [t.hour for t in pd.DatetimeIndex(test.datetime)]
test["day"] = [t.dayofweek for t in pd.DatetimeIndex(test.datetime)]
test["month"] = [t.month for t in pd.DatetimeIndex(test.datetime)]
test['year'] = [t.year for t in pd.DatetimeIndex(test.datetime)]
test['year'] = test['year'].map({2011:0, 2012:1})
X_test = test.iloc[:,1:].copy()
# generate predictions and apply the same post-processing
y_test = pipe.predict(X_test)
y_test = np.where(y_test < 0, 0, y_test.astype(int))
# create dataframe in the form expected by Kaggle
df_submit = pd.DataFrame({'datetime':test['datetime'],'count':y_test})
df_submit = df_submit[['datetime','count']]
df_submit.head(10)
| datetime | count | |
|---|---|---|
| 0 | 2011-01-20 00:00:00 | 0 |
| 1 | 2011-01-20 01:00:00 | 0 |
| 2 | 2011-01-20 02:00:00 | 0 |
| 3 | 2011-01-20 03:00:00 | 0 |
| 4 | 2011-01-20 04:00:00 | 0 |
| 5 | 2011-01-20 05:00:00 | 0 |
| 6 | 2011-01-20 06:00:00 | 0 |
| 7 | 2011-01-20 07:00:00 | 2 |
| 8 | 2011-01-20 08:00:00 | 11 |
| 9 | 2011-01-20 09:00:00 | 24 |
df_submit.to_csv('bike3.csv', index=False)
Build separate linear regression models for the other target variables (instead of the count target variable) and report your results:
registered: number of registered user
casual: number of non-registered user
Overall, we want you to build a linear regression model using pipelines that builds on what you learned as a solution to the previous section, and do a Kaggle submission for this Bike Demand prediction problem. In particular, we want you to address (using what you have done previously) the following:
SelectKBest train=pd.read_csv('Kaggle-Bike-Sharing-Demand-Challenge/train.csv')
train.head(1)
| datetime | season | holiday | workingday | weather | temp | atemp | humidity | windspeed | casual | registered | count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2011-01-01 00:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 81 | 0.0 | 3 | 13 | 16 |
train['weather1'] = np.where(train['weather']==1, 1, 0) #ifelse assignment
train['weather2'] = np.where(train['weather']==2, 1, 0)
train['weather3'] = np.where(train['weather']==3, 1, 0)
train['weather4'] = np.where(train['weather']==4, 1, 0)
train['season1'] = np.where(train['season']==1, 1, 0) #ifelse assignment
train['season2'] = np.where(train['season']==2, 1, 0)
train['season3'] = np.where(train['season']==3, 1, 0)
train['season4'] = np.where(train['season']==4, 1, 0)
train["hour"] = [t.hour for t in pd.DatetimeIndex(train.datetime)]
train["day"] = [t.dayofweek for t in pd.DatetimeIndex(train.datetime)]
train["month"] = [t.month for t in pd.DatetimeIndex(train.datetime)]
train['year'] = [t.year for t in pd.DatetimeIndex(train.datetime)]
train['year'] = train['year'].map({2011:0, 2012:1}) #OHE: One-hot-encoding (OHE): based
# dont forget to drop the original feature
train.drop(['season', 'weather', 'datetime'], axis=1, inplace=True)
# define X, y_reg, and y_cas from train
# Hint: you may want to drop some columns from train for X
#==================================================#
# Your code starts here #
#==================================================#
# TODO - change above to
X =train.drop(['registered', 'casual', 'count'], axis=1)
y_reg, y_cas = train['registered'], train['casual']
#==================================================#
# Your code ends here #
#==================================================#
np.random.seed(42)
# We can ensure the same split for both targets by setting the same random_state for both splits
X_train, X_val, y_train_reg, y_val_reg = train_test_split(X, y_reg, test_size=.3, random_state=0)
_, _, y_train_cas, y_val_cas = train_test_split(X, y_cas, test_size=.3, random_state=0)
print('Train (x,y):\t', X_train.shape,'\t',y_train_reg.shape,'\t',y_train_cas.shape)
print('Validate (x,y):\t', X_val.shape,'\t',y_val_reg.shape,'\t',y_val_cas.shape)
Train (x,y): (7620, 18) (7620,) (7620,) Validate (x,y): (3266, 18) (3266,) (3266,)
from sklearn.feature_selection import SelectKBest, f_regression
np.random.seed(42)
Kbest_REG = SelectKBest(f_regression, k=7)
Kbest_CAS = SelectKBest(f_regression, k=7)
#==================================================#
# Your code starts here #
#==================================================#
# TODO - change above to
Kbest_REG = Kbest_REG.fit(X_train, y_train_reg)
Kbest_CAS = Kbest_CAS.fit(X_train, y_train_cas)
#==================================================#
# Your code ends here #
#==================================================#
Because we used a pipeline, we can easily generate predictions on two targets using the same model
df_reg = pd.DataFrame({'feature (reg)':X_train.axes[1],
'F statistic (reg)':Kbest_REG.scores_,
'p value (reg)':Kbest_REG.pvalues_}).sort_values(by='p value (reg)')
df_reg
| feature (reg) | F statistic (reg) | p value (reg) | |
|---|---|---|---|
| 14 | hour | 1296.800472 | 2.169452e-262 |
| 2 | temp | 843.176752 | 6.452129e-176 |
| 3 | atemp | 813.639983 | 3.994344e-170 |
| 4 | humidity | 570.319498 | 1.294545e-121 |
| 17 | year | 543.091555 | 4.283623e-116 |
| 10 | season1 | 333.847849 | 4.983973e-73 |
| 16 | month | 237.504136 | 8.574231e-53 |
| 1 | workingday | 115.906246 | 7.768005e-27 |
| 8 | weather3 | 90.353034 | 2.614341e-21 |
| 12 | season3 | 71.494595 | 3.299795e-17 |
| 15 | day | 71.362797 | 3.525552e-17 |
| 6 | weather1 | 62.484946 | 3.062185e-15 |
| 5 | windspeed | 61.252272 | 5.697823e-15 |
| 13 | season4 | 32.086972 | 1.527911e-08 |
| 11 | season2 | 14.470577 | 1.434795e-04 |
| 7 | weather2 | 7.411146 | 6.496849e-03 |
| 0 | holiday | 2.416684 | 1.200904e-01 |
| 9 | weather4 | 0.000480 | 9.825140e-01 |
Last two features have p values greater than 0.01, so they will be excluded
reg_features_used = 16 # based on features witl p value of less than 0.01
featureScores = Kbest_REG.scores_
pipe_reg = Pipeline([
("MinMaxScaler", MinMaxScaler()),
('feature_selection', TopFeatureSelector(featureScores, reg_features_used)),
('linreg', LinearRegression())
])
pipe_reg.fit(X_train, y_train_reg)
y_pred_reg = pipe_reg.predict(X_val)
y_pred_reg = np.where(y_pred_reg < 0, 0, np.round(y_pred_reg).astype(int))
df_cas = pd.DataFrame({'feature (cas)':X_train.axes[1],
'F statistic (cas)':Kbest_CAS.scores_,
'p value (cas)':Kbest_CAS.pvalues_}).sort_values(by='p value (cas)')
df_cas
| feature (cas) | F statistic (cas) | p value (cas) | |
|---|---|---|---|
| 2 | temp | 2185.595067 | 0.000000e+00 |
| 3 | atemp | 2120.362611 | 0.000000e+00 |
| 4 | humidity | 1014.044057 | 5.023065e-209 |
| 1 | workingday | 841.617133 | 1.303198e-175 |
| 14 | hour | 780.018610 | 1.656453e-163 |
| 10 | season1 | 471.600445 | 1.641715e-101 |
| 15 | day | 451.598301 | 2.086401e-97 |
| 12 | season3 | 283.707495 | 1.566993e-62 |
| 11 | season2 | 149.212608 | 5.351810e-34 |
| 17 | year | 146.506982 | 2.035539e-33 |
| 6 | weather1 | 90.701218 | 2.197026e-21 |
| 8 | weather3 | 78.348896 | 1.060236e-18 |
| 16 | month | 69.897154 | 7.360533e-17 |
| 5 | windspeed | 64.076278 | 1.374261e-15 |
| 13 | season4 | 58.339405 | 2.474424e-14 |
| 7 | weather2 | 23.297184 | 1.415049e-06 |
| 0 | holiday | 16.606019 | 4.647193e-05 |
| 9 | weather4 | 0.355766 | 5.508848e-01 |
cas_features_used = 17 # based on features with p value of less than 0.01; please verify
featureScores = Kbest_CAS.scores_
pipe_cas = Pipeline([
("MinMaxScaler", MinMaxScaler()),
('feature_selection', TopFeatureSelector(featureScores, cas_features_used)),
('linreg', LinearRegression())
])
pipe_cas.fit(X_train, y_train_cas)
y_pred_cas = pipe_cas.predict(X_val)
y_pred_cas = np.where(y_pred_cas < 0, 0, np.round(y_pred_cas).astype(int))
### Visualize results
fig, ax = plt.subplots(1,2,sharey=True)
fig.set_figwidth(8)
ax[0].hist(y_pred_reg, bins=30)
ax[0].set_title('Predictions for registered users')
ax[1].hist(y_pred_cas, bins=30)
ax[1].set_title('Predictions for casual users');
#==================================================#
# Your code starts here #
#==================================================#
# TODO - change above to
rmse = np.round(np.sqrt(mean_squared_error(y_val_reg, y_pred_reg)), 3)
#==================================================#
# Your code ends here #
#==================================================#
mae = np.round(mean_absolute_error(y_val_reg, y_pred_reg), 3)
mape = np.round(mean_absolute_percentage_error(y_val_reg, y_pred_reg), 3)
expLog.loc[len(expLog)] = ['SKLearn LinReg for "registered"', mae, rmse, mape]
#==================================================#
# Your code starts here #
#==================================================#
# TODO - change above to
rmse = np.round(np.sqrt(mean_squared_error(y_val_cas, y_pred_cas)), 3)
#==================================================#
# Your code ends here #
#==================================================#
mae = np.round(mean_absolute_error(y_val_cas, y_pred_cas), 3)
mape = np.round(mean_absolute_percentage_error(y_val_cas, y_pred_cas), 3)
expLog.loc[len(expLog)] = ['SKLearn LinReg for "casual"', mae, rmse, mape]
/usr/local/lib/python3.7/site-packages/ipykernel_launcher.py:12: RuntimeWarning: divide by zero encountered in true_divide if sys.path[0] == '': /usr/local/lib/python3.7/site-packages/ipykernel_launcher.py:12: RuntimeWarning: invalid value encountered in true_divide if sys.path[0] == '': /usr/local/lib/python3.7/site-packages/ipykernel_launcher.py:12: RuntimeWarning: divide by zero encountered in true_divide if sys.path[0] == '': /usr/local/lib/python3.7/site-packages/ipykernel_launcher.py:12: RuntimeWarning: invalid value encountered in true_divide if sys.path[0] == '':
MAPE is undefined for registered and casual models because each includes zeros among the actual (true) target values.
expLog
| Model Description | MAE | RMSE | MAPE | |
|---|---|---|---|---|
| 0 | SKLearn LinReg SelectKBest=13 | 102.898 | 139.806 | 295.101 |
| 1 | SKLearn LinReg for "registered" | 87.692 | 121.494 | NaN |
| 2 | SKLearn LinReg for "casual" | 22.265 | 35.378 | NaN |
The distribution of the raw count feature is very skewed with most values occuring to the left of the count domain. To make this distribution less skewed (and more normal distributed, it is common to take the log of the value + 1 (to avoid log(0)).
Overall, we want you to build a linear regression model (single model will suffice based on the count target) using pipelines that builds on what you learned as a solution to the previous section, and do a Kaggle submission for this Bike Demand prediction problem. In particular, we want you to address (using what you have done previously) the following:
Derive other features
If barplot demand over each hour of the the day, you will notice interesting patterns. E.g., you might be able to segregate the bike demand in three hourly categories, like the following:
High : 7-9 and 17-19 hours
Average : 10-16 hours
Low : 0-6 and 20-24 hours
Sequential Backward Selection algorithmtrain=pd.read_csv('Kaggle-Bike-Sharing-Demand-Challenge/train.csv')
train.head(1)
| datetime | season | holiday | workingday | weather | temp | atemp | humidity | windspeed | casual | registered | count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2011-01-01 00:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 81 | 0.0 | 3 | 13 | 16 |
train['weather1'] = np.where(train['weather']==1, 1, 0) #ifelse assignment
train['weather2'] = np.where(train['weather']==2, 1, 0)
train['weather3'] = np.where(train['weather']==3, 1, 0)
train['weather4'] = np.where(train['weather']==4, 1, 0)
train['season1'] = np.where(train['season']==1, 1, 0) #ifelse assignment
train['season2'] = np.where(train['season']==2, 1, 0)
train['season3'] = np.where(train['season']==3, 1, 0)
train['season4'] = np.where(train['season']==4, 1, 0)
train["hour"] = [t.hour for t in pd.DatetimeIndex(train.datetime)]
train["day"] = [t.dayofweek for t in pd.DatetimeIndex(train.datetime)]
train["month"] = [t.month for t in pd.DatetimeIndex(train.datetime)]
train['year'] = [t.year for t in pd.DatetimeIndex(train.datetime)]
train['year'] = train['year'].map({2011:0, 2012:1}) #OHE: One-hot-encoding (OHE): based
# dont forget to drop the original feature
train.drop(['season', 'weather'], axis=1, inplace=True)
Separate x, y from the train data and log transform y
X, y = train.iloc[:, 1:], train['count']
X = X.drop(['registered', 'casual', 'count'], axis=1)
# transform y
#==================================================#
# Your code starts here #
#==================================================#
# TODO - change above to
y = np.log(y+1)
#==================================================#
# Your code ends here #
#==================================================#
Note that zeros no longer dominate other target values in transformed data
### Visualize target transformation
fig, ax = plt.subplots(1,2,sharey=True)
fig.set_figwidth(8)
ax[0].hist(train['count'], bins=30)
ax[0].set_title('Original targets')
ax[1].hist(y, bins=30)
ax[1].set_title('Log transformed targets');
np.random.seed(42)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=.2, random_state=0)
from sklearn.base import clone
from itertools import combinations
import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_wine
class SBS():
def __init__(self, estimator, k_features, scoring=accuracy_score,
test_size=0.25, random_state=1):
self.scoring = scoring
self.estimator = clone(estimator)
self.k_features = k_features
self.test_size = test_size
self.random_state = random_state
def fit(self, X, y):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=self.test_size,
random_state=self.random_state)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)
dim = X_train.shape[1]
self.indices_ = list(range(dim))
self.subsets_ = [self.indices_]
score = self._calc_score(X_train, y_train,
X_test, y_test, self.indices_)
self.scores_ = [score]
while dim > self.k_features:
scores = []
subsets = []
for p in combinations(self.indices_, r=dim - 1):
score = self._calc_score(X_train, y_train,
X_test, y_test, list(p))
scores.append(score)
subsets.append(p)
best = np.argmax(scores)
self.indices_ = subsets[best]
self.subsets_.append(self.indices_)
dim -= 1
self.scores_.append(scores[best])
self.k_score_ = self.scores_[-1]
return self
def transform(self, X):
return X[:, self.indices_]
def _calc_score(self, X_train, y_train, X_test, y_test, indices):
self.estimator.fit(X_train.iloc[:, indices], y_train)
y_pred = self.estimator.predict(X_test.iloc[:, indices])
score = self.scoring(y_test, y_pred)
return score
from sklearn.metrics import explained_variance_score
np.random.seed(42)
# set up SBS: use linear regression estimator with default parameters
#==================================================#
# Your code starts here #
#==================================================#
# TODO - change above to
lr =LinearRegression()
sbs = SBS(lr, 1, explained_variance_score)
#==================================================#
# Your code ends here #
#==================================================#
sbs.fit(X_train, y_train)
(6531, 18) (6531,) (2177, 18) (2177,)
<__main__.SBS at 0x7f891d030810>
Beyond 5 features there is not much additional benefit in terms of explained variance.
# plotting performance of feature subsets
k_feat = [len(k) for k in sbs.subsets_]
#plot places the points on the graph using the number of features (no need to sort!)
plt.plot(k_feat, sbs.scores_, marker='o')
plt.xticks(k_feat)
plt.ylim([0.0, 1.02])
plt.title("Sequential Backward Selection")
plt.ylabel('explained_variance_score on validation set')
plt.xlabel(r'number of features')
Text(0.5, 0, 'number of features')
Modify TopFeatureSelector to select the k-th subset our SBS object
class SBSFeatureSelector(BaseEstimator, TransformerMixin):
def __init__(self, subsets, k):
self.subsets = subsets
self.k = k
def fit(self, X, y=None):
self.feature_indices_ = sbs.subsets_[len(self.subsets)-self.k] # pull the k-th subset from the end
return self
# select the columns of data matrix of the selected input features
def transform(self, X):
return X[:, self.feature_indices_]
np.random.seed(42)
# set up the pipeline with scaler, SBS selector, and estimator
# use k=5
#==================================================#
# Your code starts here #
#==================================================#
# TODO - change above to
lr = LinearRegression()
pipe = Pipeline((
("std_scaler", StandardScaler()),
("SBS_selector", SBSFeatureSelector(sbs.subsets_, k=5)),
("lin_reg",lr),
))
#==================================================#
# Your code ends here #
#==================================================#
pipe.fit(X_train, y_train)
Pipeline(steps=[('std_scaler', StandardScaler()),
('SBS_selector',
SBSFeatureSelector(k=5,
subsets=[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
11, 12, 13, 14, 15, 16, 17],
(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
12, 13, 14, 15, 16, 17),
(2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
12, 13, 14, 15, 16, 17),
(2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
12, 13, 14, 16, 17),
(2, 3, 4, 5, 6, 7, 8, 9, 11, 12,
13, 14, 16, 17),
(2, 3, 4, 5, 6, 7, 8, 11, 12, 13,
14, 16, 17),
(2, 3, 4, 5, 6, 8, 11, 12, 13, 14,
16, 17),
(2, 3, 4, 5, 6, 8, 12, 13, 14, 16,
17),
(2, 3, 4, 5, 6, 8, 12, 14, 16, 17),
(2, 4, 5, 6, 8, 12, 14, 16, 17),
(2, 4, 6, 8, 12, 14, 16, 17),
(2, 4, 6, 8, 14, 16, 17),
(2, 4, 8, 14, 16, 17),
(2, 4, 14, 16, 17), (2, 4, 14, 17),
(2, 4, 14), (2, 14), (14,)])),
('lin_reg', LinearRegression())])
y_val_pred = pipe.predict(X_val)
mae = np.round(mean_absolute_error(y_val, y_val_pred), 3)
rmse = np.round(np.sqrt(mean_squared_error(y_val, y_val_pred)), 3)
#==================================================#
# Your code starts here #
#==================================================#
# TODO - change above to
mape = np.round(mean_absolute_percentage_error(y_val, y_val_pred), 3)
#==================================================#
# Your code ends here #
#==================================================#
expLog.loc[len(expLog)] = ['SKLearn LinReg SBS=5 Log Targets', mae, rmse, mape]
expLog
| Model Description | MAE | RMSE | MAPE | |
|---|---|---|---|---|
| 0 | SKLearn LinReg SelectKBest=13 | 102.898 | 139.806 | 295.101 |
| 1 | SKLearn LinReg for "registered" | 87.692 | 121.494 | NaN |
| 2 | SKLearn LinReg for "casual" | 22.265 | 35.378 | NaN |
| 3 | SKLearn LinReg SBS=5 Log Targets | 0.816 | 1.030 | 27.507 |
test=pd.read_csv('./Kaggle-Bike-Sharing-Demand-Challenge/test.csv')
# Create a new column called df.weather1 where the value is 1
# if df.weather is 1 and 0 if not
test['weather1'] = np.where(test['weather']==1, 1, 0) #ifelse assignment
test['weather2'] = np.where(test['weather']==2, 1, 0)
test['weather3'] = np.where(test['weather']==3, 1, 0)
test['weather4'] = np.where(test['weather']==4, 1, 0)
# dont forget to drop the original feature
test.drop(['weather'], axis=1, inplace=True)
# Create a new column called df.season1 where the value is 1
# if df.season is 1 and 0 if not
test['season1'] = np.where(test['season']==1, 1, 0) #ifelse assignment
test['season2'] = np.where(test['season']==2, 1, 0)
test['season3'] = np.where(test['season']==3, 1, 0)
test['season4'] = np.where(test['season']==4, 1, 0)
# dont forget to drop the original feature
test.drop(['season'], axis=1, inplace=True)
test["hour"] = [t.hour for t in pd.DatetimeIndex(test.datetime)]
test["day"] = [t.dayofweek for t in pd.DatetimeIndex(test.datetime)]
test["month"] = [t.month for t in pd.DatetimeIndex(test.datetime)]
test['year'] = [t.year for t in pd.DatetimeIndex(test.datetime)]
test['year'] = test['year'].map({2011:0, 2012:1})
X_test = test.iloc[:,1:]
y_pred_log = pipe.predict(X_test)
#==================================================#
# Your code starts here #
#==================================================#
# TODO - change above to
temp=np.exp(y_pred_log)-1
y_pred =np.round(temp).astype(int)
#==================================================#
# Your code ends here #
#==================================================#
# Compare original target distribution with log transformed distribution
plt.figure(figsize=(10,4))
plt.subplot(121)
n, bins, patches = plt.hist(y_pred, 50)
plt.xlabel('Bike rental count buckets')
plt.ylabel('Frequency of this count bucker')
plt.title(r'Histogram of bike rental counts (final predictions)')
plt.grid(True)
plt.subplot(122)
n, bins, patches = plt.hist(y_pred_log, 50)
plt.xlabel('Bike rental count buckets')
plt.ylabel('Frequency of this count bucker')
plt.title(r'Histogram of bike rental counts (log_transformed predictions)')
plt.grid(True)
plt.show()
# create dataframe in the form expected by Kaggle
df_submit = pd.DataFrame({'datetime':test['datetime'],'count':y_pred})
df_submit = df_submit[['datetime','count']]
# Hint: use head()
#==================================================#
# Your code starts here #
#==================================================#
# TODO - change above to
df_submit.head(10)
#==================================================#
# Your code ends here #
#==================================================#
| datetime | count | |
|---|---|---|
| 0 | 2011-01-20 00:00:00 | 12 |
| 1 | 2011-01-20 01:00:00 | 13 |
| 2 | 2011-01-20 02:00:00 | 15 |
| 3 | 2011-01-20 03:00:00 | 16 |
| 4 | 2011-01-20 04:00:00 | 18 |
| 5 | 2011-01-20 05:00:00 | 18 |
| 6 | 2011-01-20 06:00:00 | 20 |
| 7 | 2011-01-20 07:00:00 | 23 |
| 8 | 2011-01-20 08:00:00 | 26 |
| 9 | 2011-01-20 09:00:00 | 31 |
df_submit.to_csv('bike4.csv', index=False)
Regularization is a way of penalizing the model for excessive complexity, and this helps reduce the risk of overfitting.
There are many ways of doing regularization but these two are the major ones:
Use SKLearn Ridge and LASSO linear regression classes to perform feature selection. Use gridsearch to determine best value for $\alpha$ the mixing coefficent associated with the penalty term in regularized linear regression.
train=pd.read_csv('Kaggle-Bike-Sharing-Demand-Challenge/train.csv')
train['weather1'] = np.where(train['weather']==1, 1, 0) #ifelse assignment
train['weather2'] = np.where(train['weather']==2, 1, 0)
train['weather3'] = np.where(train['weather']==3, 1, 0)
train['weather4'] = np.where(train['weather']==4, 1, 0)
train['season1'] = np.where(train['season']==1, 1, 0) #ifelse assignment
train['season2'] = np.where(train['season']==2, 1, 0)
train['season3'] = np.where(train['season']==3, 1, 0)
train['season4'] = np.where(train['season']==4, 1, 0)
train["hour"] = [t.hour for t in pd.DatetimeIndex(train.datetime)]
train["day"] = [t.dayofweek for t in pd.DatetimeIndex(train.datetime)]
train["month"] = [t.month for t in pd.DatetimeIndex(train.datetime)]
train['year'] = [t.year for t in pd.DatetimeIndex(train.datetime)]
train['year'] = train['year'].map({2011:0, 2012:1}) #OHE: One-hot-encoding (OHE): based
# dont forget to drop the original feature
train.drop(['season', 'weather'], axis=1, inplace=True)
X, y = train.iloc[:, 1:], train['count']
X = X.drop(['registered', 'casual', 'count'], axis=1)
np.random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=0)
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV
np.random.seed(42)
estimators = [('ridge', Ridge()),
('lasso', Lasso())]
best_score = []
best_param = []
for estimator in estimators:
params = {estimator[0]+'__alpha':[.01, .05, .1, .5, 1, 5]}
# set up the pipeline using the standard scaler and estimator
# and grid search with pipeline, params,
# and correct scoring parameter (scoring parameter has to be a utility- where bigger is better
# such as neg_mean_squared_error, explained_variance etc. )
# for more information see the following -
# https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
# Please use neg_mean_squared_error here.
#==================================================#
# Your code starts here #
#==================================================#
# TODO - change above to
pipe = Pipeline([('scalar',StandardScaler()), estimator])
gs =GridSearchCV(pipe, params, cv=5, scoring='neg_mean_squared_error')
#==================================================#
# Your code ends here #
#==================================================#
gs.fit(X_train, y_train)
best_score.append(gs.best_score_)
best_param.append(gs.best_params_)
best_idx = np.argmax(best_score)
print('Best model is:', estimators[best_idx][0], 'with parameter', best_param[best_idx])
Best model is: lasso with parameter {'lasso__alpha': 0.5}
list(best_param[best_idx].values())[0]
0.5
#estimator = estimators[best_idx][0], list(best_param[best_idx].values())[0])
estimator = [Ridge, Lasso][best_idx]
param = list(gs.best_params_.values())[0]
print (param)
# set up the pipeline using the best estimator
pipe = Pipeline([('scalar', StandardScaler()), ('estimator', estimator(alpha=param))])
pipe.fit(X_train, y_train)
0.5
Pipeline(steps=[('scalar', StandardScaler()), ('estimator', Lasso(alpha=0.5))])
y_pred = pipe.predict(X_test)
mae = np.round(mean_absolute_error(y_pred, y_test), 3)
rmse = np.round(np.sqrt(mean_squared_error(y_pred, y_test)), 3)
#==================================================#
# Your code starts here #
#==================================================#
# TODO - change above to
mape = np.round(mean_absolute_percentage_error(y_test, y_pred), 3)
#==================================================#
# Your code ends here #
#==================================================#
expLog.loc[len(expLog)] = ['SKLearn Linear Regression GS (best regularization and alpha)', mae, rmse, mape]
expLog
| Model Description | MAE | RMSE | MAPE | |
|---|---|---|---|---|
| 0 | SKLearn LinReg SelectKBest=13 | 102.898 | 139.806 | 295.101 |
| 1 | SKLearn LinReg for "registered" | 87.692 | 121.494 | NaN |
| 2 | SKLearn LinReg for "casual" | 22.265 | 35.378 | NaN |
| 3 | SKLearn LinReg SBS=5 Log Targets | 0.816 | 1.030 | 27.507 |
| 4 | SKLearn Linear Regression GS (best regularizat... | 105.472 | 140.201 | 354.104 |
test=pd.read_csv('./Kaggle-Bike-Sharing-Demand-Challenge/test.csv')
# Create a new column called df.weather1 where the value is 1
# if df.weather is 1 and 0 if not
test['weather1'] = np.where(test['weather']==1, 1, 0) #ifelse assignment
test['weather2'] = np.where(test['weather']==2, 1, 0)
test['weather3'] = np.where(test['weather']==3, 1, 0)
test['weather4'] = np.where(test['weather']==4, 1, 0)
# dont forget to drop the original feature
test.drop(['weather'], axis=1, inplace=True)
# Create a new column called df.season1 where the value is 1
# if df.season is 1 and 0 if not
test['season1'] = np.where(test['season']==1, 1, 0) #ifelse assignment
test['season2'] = np.where(test['season']==2, 1, 0)
test['season3'] = np.where(test['season']==3, 1, 0)
test['season4'] = np.where(test['season']==4, 1, 0)
# dont forget to drop the original feature
test.drop(['season'], axis=1, inplace=True)
test["hour"] = [t.hour for t in pd.DatetimeIndex(test.datetime)]
test["day"] = [t.dayofweek for t in pd.DatetimeIndex(test.datetime)]
test["month"] = [t.month for t in pd.DatetimeIndex(test.datetime)]
test['year'] = [t.year for t in pd.DatetimeIndex(test.datetime)]
test['year'] = test['year'].map({2011:0, 2012:1})
X_test = test.iloc[:,1:]
y_pred = pipe.predict(X_test)
y_pred = np.where(y_pred < 0, 0, np.round(y_pred).astype(int))
plt.hist(y_pred, bins=30);
df_submit = pd.DataFrame({'datetime':test['datetime'],'count':y_pred})
df_submit = df_submit[['datetime','count']]
df_submit.head(10)
| datetime | count | |
|---|---|---|
| 0 | 2011-01-20 00:00:00 | 0 |
| 1 | 2011-01-20 01:00:00 | 0 |
| 2 | 2011-01-20 02:00:00 | 0 |
| 3 | 2011-01-20 03:00:00 | 0 |
| 4 | 2011-01-20 04:00:00 | 0 |
| 5 | 2011-01-20 05:00:00 | 0 |
| 6 | 2011-01-20 06:00:00 | 0 |
| 7 | 2011-01-20 07:00:00 | 6 |
| 8 | 2011-01-20 08:00:00 | 14 |
| 9 | 2011-01-20 09:00:00 | 39 |
Why didn't we use feature selection for this problem?
print(f"The winning pipeline:\n {pipe}")
The winning pipeline:
Pipeline(steps=[('scalar', StandardScaler()), ('estimator', Lasso(alpha=0.5))])
Get access to the internals of the pipeline using the following:
pipe.named_steps.estimator.coef_
# note that Lasso regularization zeroed out several of the features
print(f"note that Lasso regularization zeroed out several of the features :\n{pipe.named_steps.estimator.coef_}")
note that Lasso regularization zeroed out several of the features : [ -0.86851593 0.7986806 49.30631281 14.57379078 -37.02175395 1.66579222 -0. 2.47906541 -8.54516982 0.18550432 -0. 0.13640983 -18.61751468 0. 53.581956 0. 27.22198062 39.3896986 ]
This section provides all the data and code. Please run all cells in this section and respond to the questions included here. Feel free to modify the code once you have completed the tasks, amd experiment with new settings.
Predict diabetes progression one year after baseline using patient information and tests such as:
import numpy as np
import pandas as pd
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set(style="whitegrid", font_scale=1.3)
matplotlib.rcParams["legend.framealpha"] = 1
matplotlib.rcParams["legend.frameon"] = True
##
diabetes = load_diabetes()
print(diabetes.DESCR)
.. _diabetes_dataset:
Diabetes dataset
----------------
Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.
**Data Set Characteristics:**
:Number of Instances: 442
:Number of Attributes: First 10 columns are numeric predictive values
:Target: Column 11 is a quantitative measure of disease progression one year after baseline
:Attribute Information:
- age age in years
- sex
- bmi body mass index
- bp average blood pressure
- s1 tc, total serum cholesterol
- s2 ldl, low-density lipoproteins
- s3 hdl, high-density lipoproteins
- s4 tch, total cholesterol / HDL
- s5 ltg, possibly log of serum triglycerides level
- s6 glu, blood sugar level
Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times `n_samples` (i.e. the sum of squares of each column totals 1).
Source URL:
https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html
For more information see:
Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) "Least Angle Regression," Annals of Statistics (with discussion), 407-499.
(https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)
Perform EDA through visualizing the data and also do other statistical analysis.
Note that we shift the X data to make it all positive. It doesn't affect the results, but some of the subsequent algorithms only work with positive data
X = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
y = diabetes.target
X.head()
| age | sex | bmi | bp | s1 | s2 | s3 | s4 | s5 | s6 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.038076 | 0.050680 | 0.061696 | 0.021872 | -0.044223 | -0.034821 | -0.043401 | -0.002592 | 0.019908 | -0.017646 |
| 1 | -0.001882 | -0.044642 | -0.051474 | -0.026328 | -0.008449 | -0.019163 | 0.074412 | -0.039493 | -0.068330 | -0.092204 |
| 2 | 0.085299 | 0.050680 | 0.044451 | -0.005671 | -0.045599 | -0.034194 | -0.032356 | -0.002592 | 0.002864 | -0.025930 |
| 3 | -0.089063 | -0.044642 | -0.011595 | -0.036656 | 0.012191 | 0.024991 | -0.036038 | 0.034309 | 0.022692 | -0.009362 |
| 4 | 0.005383 | -0.044642 | -0.036385 | 0.021872 | 0.003935 | 0.015596 | 0.008142 | -0.002592 | -0.031991 | -0.046641 |
X.describe() # Let's also take a look into correlation matrix of features
| age | sex | bmi | bp | s1 | s2 | s3 | s4 | s5 | s6 | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 4.420000e+02 | 4.420000e+02 | 4.420000e+02 | 4.420000e+02 | 4.420000e+02 | 4.420000e+02 | 4.420000e+02 | 4.420000e+02 | 4.420000e+02 | 4.420000e+02 |
| mean | -3.634285e-16 | 1.308343e-16 | -8.045349e-16 | 1.281655e-16 | -8.835316e-17 | 1.327024e-16 | -4.574646e-16 | 3.777301e-16 | -3.830854e-16 | -3.412882e-16 |
| std | 4.761905e-02 | 4.761905e-02 | 4.761905e-02 | 4.761905e-02 | 4.761905e-02 | 4.761905e-02 | 4.761905e-02 | 4.761905e-02 | 4.761905e-02 | 4.761905e-02 |
| min | -1.072256e-01 | -4.464164e-02 | -9.027530e-02 | -1.123996e-01 | -1.267807e-01 | -1.156131e-01 | -1.023071e-01 | -7.639450e-02 | -1.260974e-01 | -1.377672e-01 |
| 25% | -3.729927e-02 | -4.464164e-02 | -3.422907e-02 | -3.665645e-02 | -3.424784e-02 | -3.035840e-02 | -3.511716e-02 | -3.949338e-02 | -3.324879e-02 | -3.317903e-02 |
| 50% | 5.383060e-03 | -4.464164e-02 | -7.283766e-03 | -5.670611e-03 | -4.320866e-03 | -3.819065e-03 | -6.584468e-03 | -2.592262e-03 | -1.947634e-03 | -1.077698e-03 |
| 75% | 3.807591e-02 | 5.068012e-02 | 3.124802e-02 | 3.564384e-02 | 2.835801e-02 | 2.984439e-02 | 2.931150e-02 | 3.430886e-02 | 3.243323e-02 | 2.791705e-02 |
| max | 1.107267e-01 | 5.068012e-02 | 1.705552e-01 | 1.320442e-01 | 1.539137e-01 | 1.987880e-01 | 1.811791e-01 | 1.852344e-01 | 1.335990e-01 | 1.356118e-01 |
Let's also take a look into correlation matrix of features
# compute the correlation matrix
X_y = pd.DataFrame(np.c_[diabetes.data, diabetes.target], columns = diabetes.feature_names + ["diabetesProg"])
# NOTE: np.c_[diabetes.data, diabetes.target] append vector to matrix
corr = X_y.corr()
corr
| age | sex | bmi | bp | s1 | s2 | s3 | s4 | s5 | s6 | diabetesProg | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| age | 1.000000 | 0.173737 | 0.185085 | 0.335427 | 0.260061 | 0.219243 | -0.075181 | 0.203841 | 0.270777 | 0.301731 | 0.187889 |
| sex | 0.173737 | 1.000000 | 0.088161 | 0.241013 | 0.035277 | 0.142637 | -0.379090 | 0.332115 | 0.149918 | 0.208133 | 0.043062 |
| bmi | 0.185085 | 0.088161 | 1.000000 | 0.395415 | 0.249777 | 0.261170 | -0.366811 | 0.413807 | 0.446159 | 0.388680 | 0.586450 |
| bp | 0.335427 | 0.241013 | 0.395415 | 1.000000 | 0.242470 | 0.185558 | -0.178761 | 0.257653 | 0.393478 | 0.390429 | 0.441484 |
| s1 | 0.260061 | 0.035277 | 0.249777 | 0.242470 | 1.000000 | 0.896663 | 0.051519 | 0.542207 | 0.515501 | 0.325717 | 0.212022 |
| s2 | 0.219243 | 0.142637 | 0.261170 | 0.185558 | 0.896663 | 1.000000 | -0.196455 | 0.659817 | 0.318353 | 0.290600 | 0.174054 |
| s3 | -0.075181 | -0.379090 | -0.366811 | -0.178761 | 0.051519 | -0.196455 | 1.000000 | -0.738493 | -0.398577 | -0.273697 | -0.394789 |
| s4 | 0.203841 | 0.332115 | 0.413807 | 0.257653 | 0.542207 | 0.659817 | -0.738493 | 1.000000 | 0.617857 | 0.417212 | 0.430453 |
| s5 | 0.270777 | 0.149918 | 0.446159 | 0.393478 | 0.515501 | 0.318353 | -0.398577 | 0.617857 | 1.000000 | 0.464670 | 0.565883 |
| s6 | 0.301731 | 0.208133 | 0.388680 | 0.390429 | 0.325717 | 0.290600 | -0.273697 | 0.417212 | 0.464670 | 1.000000 | 0.382483 |
| diabetesProg | 0.187889 | 0.043062 | 0.586450 | 0.441484 | 0.212022 | 0.174054 | -0.394789 | 0.430453 | 0.565883 | 0.382483 | 1.000000 |
# compute the correlation matrix
X_y = pd.DataFrame(np.c_[diabetes.data, diabetes.target], columns = diabetes.feature_names + ["diabetesProg"])
# NOTE: np.c_[diabetes.data, diabetes.target] append vector to matrix
corr = X_y.corr()
# generate a mask for the lower triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
# set up the matplotlib figure
f, ax = plt.subplots(figsize=(18, 18))
# generate a custom diverging colormap
cmap = sns.diverging_palette(220, 11, as_cmap=True)
# draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3,
square=True,
linewidths=.5, cbar_kws={"shrink": .5}, ax=ax);
pairplot()¶from sklearn import datasets
import numpy as np
diabetes = datasets.load_diabetes()
# iris = datasets.load_iris()
#X = iris.data[:, [2, 3]]
#y = iris.target
type(y)
diabetes_i_o_matrix = np.c_[diabetes.data, diabetes.target] #append vector to matrix
df = pd.DataFrame(diabetes_i_o_matrix, columns=diabetes.feature_names + ["diabetesProg"])
sns.pairplot(df, kind="reg")
df.head()
| age | sex | bmi | bp | s1 | s2 | s3 | s4 | s5 | s6 | diabetesProg | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.038076 | 0.050680 | 0.061696 | 0.021872 | -0.044223 | -0.034821 | -0.043401 | -0.002592 | 0.019908 | -0.017646 | 151.0 |
| 1 | -0.001882 | -0.044642 | -0.051474 | -0.026328 | -0.008449 | -0.019163 | 0.074412 | -0.039493 | -0.068330 | -0.092204 | 75.0 |
| 2 | 0.085299 | 0.050680 | 0.044451 | -0.005671 | -0.045599 | -0.034194 | -0.032356 | -0.002592 | 0.002864 | -0.025930 | 141.0 |
| 3 | -0.089063 | -0.044642 | -0.011595 | -0.036656 | 0.012191 | 0.024991 | -0.036038 | 0.034309 | 0.022692 | -0.009362 | 206.0 |
| 4 | 0.005383 | -0.044642 | -0.036385 | 0.021872 | 0.003935 | 0.015596 | 0.008142 | -0.002592 | -0.031991 | -0.046641 | 135.0 |
Split data into 70% training and 30% test data. GridSearch uses crossfold validation. Therefore we do not need a validation set.
type(X) #data stored in a numpy.ndarray
pandas.core.frame.DataFrame
from sklearn.model_selection import train_test_split
from sklearn import datasets
import numpy as np
ds = datasets.load_diabetes()
X_train, X_test, y_train, y_test = train_test_split(
ds.data, ds.target, test_size=0.3, random_state=1)
print(f'X_train.shape: {X_train.shape} y_train: {len(y_test)}')
print(f'X_test.shape: {X_test.shape} y_test: {len(y_test)}')
X_train.shape: (309, 10) y_train: 133 X_test.shape: (133, 10) y_test: 133
At this point we could do some feature engineering. For example, we could create a new area feature based upon taking the product of the length and width features if a problem had such features (e.g., see the iris flower classification problem). We would view this feature as a derived feature based upon the interaction of two base input features.
Next we look at how to generate interaction features and polynomial features using SKLearn's PolynomialFeatures().
Here we explore datasets with polynomial features and interaction features.
First of all, we can generate such a dataset using sklearn, this code will demonstrate how to use it. (You can also use this link to check the user guide)
from sklearn.preprocessing import PolynomialFeatures
x = np.arange(4).reshape(2, 2)
poly = PolynomialFeatures(2)
print (x)
poly.fit_transform(x)
### outputs the following:
[[0 1]
[2 3]]
array([[1., 0., 1., 0., 0., 1.],
[1., 2., 3., 4., 6., 9.]])
Here using the original dataset of the form $(x_1,x_2)$ = [[0,1], [2,3]], in combination with PolynomialFeatures(2) yields a new dataset $(1,x_1,x_2,x_1^2,x_1x_2,x_2^2)$ = (1,0,1,0,0,1), (1,2,3,4,6,9)
The Pipeline class is not restricted to preprocessing and classification, but can in fact join any number of estimators together. For example, you could build a pipeline containing feature extraction, feature selection, scaling, and classification, for a total of four steps (at least!). Similarly, the last step could be regression or clustering instead of classification.
The only requirement for estimators in a pipeline is that all but the last step need to
have a transform method, so they can produce a new representation of the data that
can be used in the next step.
Internally, during the call to Pipeline.fit(), the pipeline calls fit and then transform
on each step in turn(or just fit_transform()), with the input given by the output of the transform method of the previous step. For the last step in the pipeline, just fit() is called.
def heatmap(values, xlabel, ylabel, xticklabels, yticklabels, cmap=None,
vmin=None, vmax=None, ax=None, fmt="%0.2f", title =""):
if ax is None:
ax = plt.gca()
# plot the mean cross-validation scores
img = ax.pcolor(values, cmap=cmap, vmin=vmin, vmax=vmax)
img.update_scalarmappable()
ax.set_xlabel(xlabel)
ax.set_ylabel(ylabel)
ax.set_xticks(np.arange(len(xticklabels)) + .5)
ax.set_yticks(np.arange(len(yticklabels)) + .5)
ax.set_xticklabels(xticklabels)
ax.set_yticklabels(yticklabels)
ax.set_aspect(1)
ax.set_title(title)
for p, color, value in zip(img.get_paths(), img.get_facecolors(),
img.get_array()):
x, y = p.vertices[:-2, :].mean(0)
if np.mean(color[:3]) > 0.5:
c = 'k'
else:
c = 'w'
ax.text(x, y, fmt % value, color=c, ha="center", va="center")
return img
from sklearn.datasets import load_diabetes
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import SelectPercentile, f_regression
ds = load_diabetes()
X_train, X_test, y_train, y_test = train_test_split(ds.data, ds.target,
random_state=0)
from sklearn.preprocessing import PolynomialFeatures
pipe = make_pipeline(
PolynomialFeatures(),
StandardScaler(),
Ridge())
param_grid = {'polynomialfeatures__degree': [1, 2, 3],
'ridge__alpha': [0.001, 0.01, 0.1, 1, 10, 100],
}
grid = GridSearchCV(pipe, param_grid=param_grid, cv=5, n_jobs=-1, scoring='neg_mean_absolute_error')
grid.fit(X_train, y_train)
GridSearchCV(cv=5,
estimator=Pipeline(steps=[('polynomialfeatures',
PolynomialFeatures()),
('standardscaler', StandardScaler()),
('ridge', Ridge())]),
n_jobs=-1,
param_grid={'polynomialfeatures__degree': [1, 2, 3],
'ridge__alpha': [0.001, 0.01, 0.1, 1, 10, 100]},
scoring='neg_mean_absolute_error')
# multiply by -1 to convert utility to loss
# I.e., recover mean_absolute_error from 'neg_mean_absolute_error'
grid.cv_results_['mean_test_score'].reshape(3, -1)*[-1.0]
array([[ 44.91738873, 44.91647662, 44.90770714, 44.84493389,
44.74831585, 46.00356004],
[ 48.46119478, 48.1859605 , 47.82371022, 46.45813267,
44.91168914, 45.65990836],
[265.66721069, 176.00935899, 120.51317976, 77.00289633,
53.78820913, 47.23253223]])
heatmap(-grid.cv_results_['mean_test_score'].reshape(3, -1),
xlabel="ridge__alpha", ylabel="polynomialfeatures__degree",
xticklabels=param_grid['ridge__alpha'],
yticklabels=param_grid['polynomialfeatures__degree'], vmin=0,
title="GridSearch Results\n CV avg (Mean absolute error) NOTE: Smaller is better!")
<matplotlib.collections.PolyCollection at 0x7f8955c6e210>
print("Best parameters: {}".format(grid.best_params_))
Best parameters: {'polynomialfeatures__degree': 1, 'ridge__alpha': 10}
# score returns neg_mean_absolute_error so must muiltiply by -1.0 to
# get the loss metric: mean_absolute_error
print("Test-set score: {:.2f}".format(grid.score(X_test, y_test) *-1.0))
Test-set score: 45.15
Do the polynomial features help with to the test error?
What is the mean absolute error on the test dataset for this pipeline? Report to two decimal places.
Just crosschecking the pipeline with original features only
param_grid = {'ridge__alpha': [0.001, 0.01, 0.1, 1, 10, 100]}
pipe = make_pipeline(StandardScaler(), Ridge())
grid = GridSearchCV(pipe, param_grid, cv=5, n_jobs=-1, scoring='neg_mean_absolute_error')
grid.fit(X_train, y_train)
print("Test-score without poly features: {:.2f}".format(-grid.score(X_test, y_test)))
Test-score without poly features: 45.15
from sklearn.datasets import load_diabetes
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import SelectPercentile, f_regression, SelectKBest
ds = load_diabetes()
X_train, X_test, y_train, y_test = train_test_split(ds.data, ds.target,
random_state=0)
from sklearn.preprocessing import PolynomialFeatures
pipe = make_pipeline(
StandardScaler(),
PolynomialFeatures(),
SelectPercentile(score_func=f_regression),
#SelectKBest(score_func=f_regression),
Ridge())
param_grid = {'polynomialfeatures__degree': [1, 2, 3],
'ridge__alpha': [0.001, 0.01, 0.1, 1, 10, 100],
'selectpercentile__percentile' : [10,20,50, 65, 70,80,90]
#'selectkbest__k':[2, 5, 6, 7,8,9,10]
}
#grid = GridSearchCV(pipe, param_grid=param_grid, cv=5, n_jobs=-1)
grid = GridSearchCV(pipe, param_grid, cv=5, n_jobs=-1, scoring='neg_mean_absolute_error')
grid.fit(X_train, y_train)
print("Best parameters: {}".format(grid.best_params_))
print("Test-set score: {:.2f}".format(-grid.score(X_test, y_test)))
#grid.estimator.named_steps.selectkbest.get_support(indices=True) #does not work as expected. Bug?
Best parameters: {'polynomialfeatures__degree': 2, 'ridge__alpha': 0.1, 'selectpercentile__percentile': 65}
Test-set score: 49.13
/usr/local/lib/python3.7/site-packages/sklearn/feature_selection/_univariate_selection.py:302: RuntimeWarning: divide by zero encountered in true_divide corr /= X_norms /usr/local/lib/python3.7/site-packages/sklearn/feature_selection/_univariate_selection.py:307: RuntimeWarning: invalid value encountered in true_divide F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
Use selectpercentile for feature selection, what is the best percentile using this pipeline?
65
65
What is the mean absolute error on the test dataset for this pipeline? Report to two decimal places.
49.13
49.13
Consider the provided dataset "1.csv". The datas contains 5 columns, $x_1, x_2, x_3, x_4$, and $y$. We know that the datas were generated by the following function $y=\sum_{i=1}^4 w_i x_i+b$, where $w_1, \ldots, w_4,b$ are constants. However, we do not know the coefficients, and the datas are noisy. Let us implement linear regression to find out $w_1, \ldots, w_4,b$.
import pandas as pd
import numpy as np
data = pd.read_csv('1.csv')
x = data.values[:,0:4]
y = data.values[:,4]
print(np.shape(x))
print(np.shape(y))
data.describe()
(1000, 4) (1000,)
| x1 | x2 | x3 | x4 | y | |
|---|---|---|---|---|---|
| count | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 |
| mean | 5.059842 | 4.879701 | 4.904800 | 4.984575 | 22.175150 |
| std | 2.884683 | 2.905157 | 2.842813 | 2.842572 | 9.040555 |
| min | 0.005965 | 0.000410 | 0.003563 | 0.002923 | 2.825928 |
| 25% | 2.515286 | 2.332052 | 2.404860 | 2.677740 | 14.833665 |
| 50% | 5.195768 | 4.649766 | 5.025517 | 4.959585 | 22.527191 |
| 75% | 7.543820 | 7.311535 | 7.320756 | 7.429816 | 29.476139 |
| max | 9.999529 | 9.983166 | 9.995840 | 9.996643 | 40.579218 |
Then, we implement data augmentation as shown in video 4.5 and solve for the closed-form solution for linear regression
#Data augmentation
X = np.c_[np.ones(x.shape[0]), x]
print(np.shape(X))
(1000, 5)
#Closed-form solution (Hint: You can use np.matmul for matrix multiplication)
#==================================================#
# Place your code between here #
w = np.matmul(np.matmul(np.linalg.inv(np.matmul(X.transpose(),X)),X.transpose()),y)
# ... and here #
# When asked to copy/paste your code in homework #
# submission quiz, only submit the code you added #
# (or modified) between the comment blocks #
# Hint: you need to set the params used below #
#==================================================#
print("Linear regrssion model (note weight_0 represents the bias term)")
for i, w_v in enumerate(w):
print (f"weight_{i}: {w_v:.3f}")
Linear regrssion model (note weight_0 represents the bias term) weight_0: 1.999 weight_1: 3.000 weight_2: 0.001 weight_3: 0.004 weight_4: 0.998
#Calculate Mean-Squared Error
#==================================================#
# Place your code between here #
yhat =X.dot(w)
m=X.shape[0]
MSE = 1/float(m)*(np.dot((yhat-y),(yhat-y)))
# When asked to copy/paste your code in homework #
# submission quiz, only submit the code you added #
# (or modified) between the comment blocks #
# Hint: you need to set the params used below #
#==================================================#
print (f"MSE: {MSE:.3f}")
MSE: 0.079
0.079
0.079
Let's try the linear regression tool from sklearn, is the solution similar to yours?
from sklearn.linear_model import LinearRegression
reg = LinearRegression(fit_intercept=False).fit(X, y)
#reg.coef_
print("Linear regrssion model (note weight_0 represents the bias term)")
for i, w_v in enumerate(reg.coef_):
print (f"weight_{i}: {w_v:.3f}")
Linear regrssion model (note weight_0 represents the bias term) weight_0: 1.999 weight_1: 3.000 weight_2: 0.001 weight_3: 0.004 weight_4: 0.998
Here we can see that the original function could be $y=3x_1+x_4+2$
Part1 of this questions shows a linear relation between the inputs and the outputs. However this is usually not the case for real-life applications. In this part, we will explore datasets with polynomial features and interaction features.
First of all, we can generate such a dataset using sklearn, this code will demonstrate how to use it. (You can also use this link to check the user guide)
from sklearn.preprocessing import PolynomialFeatures
x = np.arange(4).reshape(2, 2)
poly = PolynomialFeatures(2)
print (x)
poly.fit_transform(x)
[[0 1] [2 3]]
array([[1., 0., 1., 0., 0., 1.],
[1., 2., 3., 4., 6., 9.]])
Here the original dataset is $(x_1,x_2)$ = (0,1), (2,3), and we generated a new dataset $(1,x_1,x_2,x_1^2,x_1x_2,x_2^2)$ = (1,0,1,0,0,1), (1,2,3,4,6,9)
Now consider a dataset "2.csv" similar to part1, generated by the function $y=\sum_{i=1}^3 w_i x_i+\sum_{i=1}^3 v_i x_i^2+\sum_{i,j=1(i\ne j)}^3 u_{ij} x_i x_j + b$.
#load the new dataset
data = pd.read_csv('2.csv')
x = data.values[:,0:3]
y = data.values[:,3]
print(np.shape(x))
print(np.shape(y))
data.describe()
(1000, 3) (1000,)
| x1 | x2 | x3 | y | |
|---|---|---|---|---|
| count | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 |
| mean | 5.062826 | 4.858563 | 4.984561 | 158.216686 |
| std | 2.802937 | 2.813091 | 2.807530 | 101.954698 |
| min | 0.001398 | 0.012943 | 0.047803 | -2.496791 |
| 25% | 2.634888 | 2.567765 | 2.607776 | 77.119631 |
| 50% | 5.105599 | 4.831760 | 4.885898 | 144.897056 |
| 75% | 7.430155 | 7.089565 | 7.326185 | 222.401429 |
| max | 9.993621 | 9.984340 | 9.998866 | 544.752564 |
Is this new function solvable by linear regression? The answer is yes. The only thing we have to do is to generate polynomial and interaction features.
X = poly.fit_transform(x)
print (np.shape(X))
(1000, 10)
reg = LinearRegression(fit_intercept=False).fit(X, y)
reg.coef_
array([-1.00000000e+00, -2.60239935e-14, -1.41782284e-14, -1.00000000e+00,
2.00000000e+00, 1.00000000e+00, -1.62530260e-15, -9.09922151e-16,
3.00000000e+00, -1.24512265e-15])
Now, we know these are the coefficients of the original function, let's see if we can recover them from all the derived features (interactions and polynomials). Run the following code and respond to the questions following the code.
from sklearn.datasets import load_diabetes
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import SelectPercentile, f_regression, SelectKBest
from sklearn.model_selection import train_test_split
#ds = load_diabetes()
X_train, X_test, y_train, y_test = train_test_split(x,y,random_state=0)
from sklearn.preprocessing import PolynomialFeatures
#==================================================#
# Place your code between here #
pipe = make_pipeline(
StandardScaler(),
PolynomialFeatures(),
SelectPercentile(score_func=f_regression),
Ridge())
param_grid = {'polynomialfeatures__degree': [1, 2, 3],
'ridge__alpha': [0.001, 0.01, 0.1, 1, 10, 100],
'selectpercentile__percentile' : [10,20,50,65,70, 80 ,90] } #
# When asked to copy/paste your code in homework #
# submission quiz, only submit the code you added #
# (or modified) between the comment blocks #
# Hints:
# 1. The pipeline should contain:
# StandardScaler, SelectPercentile with f_regression
# as the scoring function parameter, PolynomialFeatures,
# and Ridge
# 2. SelectPercentile percentile parameter should have
# following options: 10,20,50,65,70,80,90
#==================================================#
#grid = GridSearchCV(pipe, param_grid=param_grid, cv=5, n_jobs=-1)
grid = GridSearchCV(pipe, param_grid, cv=5, n_jobs=-1, scoring='neg_mean_absolute_error')
grid.fit(X_train, y_train)
print("Best parameters: {}".format(grid.best_params_))
print("Test-set score: {:.5f}".format(-grid.score(X_test, y_test)))
#grid.estimator.named_steps.selectkbest.get_support(indices=True) #does not work as expected. Bug?
Best parameters: {'polynomialfeatures__degree': 2, 'ridge__alpha': 0.001, 'selectpercentile__percentile': 90}
Test-set score: 0.00011
/usr/local/lib/python3.7/site-packages/sklearn/feature_selection/_univariate_selection.py:302: RuntimeWarning: divide by zero encountered in true_divide corr /= X_norms /usr/local/lib/python3.7/site-packages/sklearn/feature_selection/_univariate_selection.py:307: RuntimeWarning: invalid value encountered in true_divide F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
Ok, so we just used grid search to search for the best model/parameter combination for this model. Let's take a look and understand what it means.
First, let's see what is inside this pipeline.
print (grid.best_estimator_)
poly = grid.best_estimator_.named_steps['polynomialfeatures']
select = grid.best_estimator_.named_steps['selectpercentile']
ridge = grid.best_estimator_.named_steps['ridge']
Pipeline(steps=[('standardscaler', StandardScaler()),
('polynomialfeatures', PolynomialFeatures()),
('selectpercentile',
SelectPercentile(percentile=90,
score_func=<function f_regression at 0x7f3212a4d560>)),
('ridge', Ridge(alpha=0.001))])
import numpy as np
print (np.shape(X_train))
X_train_poly = poly.fit_transform(X_train)
print (np.shape(X_train_poly))
(750, 3) (750, 10)
X_train_poly_selected = select.fit_transform(X_train_poly,y_train)
print (np.shape(X_train_poly_selected))
#Here we are selecting 90% top features, so only 8 features remaining.
(750, 9)
/usr/local/lib/python3.7/site-packages/sklearn/feature_selection/_univariate_selection.py:302: RuntimeWarning: divide by zero encountered in true_divide corr /= X_norms /usr/local/lib/python3.7/site-packages/sklearn/feature_selection/_univariate_selection.py:307: RuntimeWarning: invalid value encountered in true_divide F = corr ** 2 / (1 - corr ** 2) * degrees_of_freedom
To figure out which feature was dropped, we print out the order of coefficient, then look at the p_values of feature selection step.
feature_names = features = poly.get_feature_names()
print(f"all features before selection: \n{feature_names}")
#p_values = select.pvalues_
#print ("{p_values:}")
feature_names_selected = [feature_names[k] for k in select.get_support(indices=True)]
print(f"{feature_names_selected}\n{len(feature_names_selected)} features of {len(feature_names)} were selected ")
print("NOTE: the bias feature was dropped temporarily")
print(f"The following features were dropped{set(feature_names) - set(feature_names_selected)}")
all features before selection:
['1', 'x0', 'x1', 'x2', 'x0^2', 'x0 x1', 'x0 x2', 'x1^2', 'x1 x2', 'x2^2']
['x0', 'x1', 'x2', 'x0^2', 'x0 x1', 'x0 x2', 'x1^2', 'x1 x2', 'x2^2']
9 features of 10 were selected
NOTE: the bias feature was dropped temporarily
The following features were dropped{'1'}
np.round(34.4415556, 2)
34.44
print("The Linear regressions model is as follows:")
print(f"{dict(zip(feature_names_selected, np.r_[ridge.intercept_, ridge.coef_]))}")
print("The polynomial regression model is a follows:")
display(pd.DataFrame(np.expand_dims(np.round(np.r_[ridge.intercept_, ridge.coef_], 3), axis =0), columns=["BIAS"]+feature_names_selected))
The Linear regressions model is as follows:
{'x0': 142.14823544562023, 'x1': 71.16522493552299, 'x2': 55.93075961765666, 'x0^2': 38.576169686766264, 'x0 x1': 16.169247295671603, 'x0 x2': 7.968779495896682, 'x1^2': -4.0210300969313195e-06, 'x1 x2': 1.0596035027770574e-05, 'x2^2': 23.783516179782893}
The polynomial regression model is a follows:
| BIAS | x0 | x1 | x2 | x0^2 | x0 x1 | x0 x2 | x1^2 | x1 x2 | x2^2 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 142.148 | 71.165 | 55.931 | 38.576 | 16.169 | 7.969 | -0.0 | 0.0 | 23.784 | -0.0 |
### Question: what is the coefficient for the term $x_1x_2$?
What is the coefficient for the term $x_1x_2$? Report with 3 decimal point precision. Example: 5.139 is reported for 5.1387462123
23.784
23.784
The objective function of linear regression is the following: $$ f(\mathbf{w}, b) = \frac{1}{n}\sum_{i=1}^{n}\left[ (\mathbf{w}\cdot\mathbf{x}_i + b) - y_i\right]^2,\\ n = \left|X_{\text{train}}\right| $$ So we want to minimize the squared value of difference between predictions and real answers. It is called Mean Squared Error (MSE). Gradient Descent is the way of optimizing this complex functional and tune weigths $\mathbf{w}$ and bias $b$.
To be able to treat weigths $\mathbf{w}$ and bias $b$ homogeneously we're going to augment the data with the "shell" feature (all $1$'s). Then we can add one more parameter to the weight vector and treat it as a bias. $$ \mathbf{x}' := \begin{bmatrix} \mathbf{x}\\ 1 \end{bmatrix},\quad \boldsymbol{\theta} := \begin{bmatrix} \mathbf{w}\\ b \end{bmatrix} \\ f(\boldsymbol{\theta}) = \frac{1}{n}\sum_{i=1}^{n}\left[ \boldsymbol{\theta}\cdot\mathbf{x}'_i - y_i\right]^2 $$ In this way it is much more easier to carry out oprimization process.
To simplify it further and do it in "tensor" way let's rewrite it in matrix form. Let's introduce data matrix (the same as dataframe we used everywhere above)
$$ \text{X}' = \begin{bmatrix} \mathbf{x'}_1^{\text{T}}\\ \vdots\\ \mathbf{x'}_n^{\text{T}} \end{bmatrix},\quad \mathbf{y} = \begin{bmatrix} y_1\\ \vdots\\ y_n \end{bmatrix} $$Matrix $\text{X}$ contains objects in its rows and features in its columns. Vector $\mathbf{y}$ is a vector of answers. Then the objective can be rewritten as follows:
$$ f(\boldsymbol{\theta}) = \frac{1}{n}\|\text{X}'\cdot \boldsymbol{\theta} - \mathbf{y}\|_2^2 $$Then the gradient can be easily calculated in vectorized form:
$$ \nabla_{\boldsymbol{\theta}} f(\boldsymbol{\theta}) = \frac{2}{n}\,\text{X}'^{\text{T}}\left(\text{X}'\cdot \boldsymbol{\theta} - \mathbf{y}\right) $$Exactly this computations are implemented down below in BasicLinearRegressionHomegrown class
Split into train and test set (with the same $\text{random_state}$ which means we can compare results)
X = pd.DataFrame(boston.data, columns=boston.feature_names)
y = boston.target
np.random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Scaling
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
Basic version of homegrown Linear Regression
class BasicLinearRegressionHomegrown(object):
def __init__(self):
self.coef_ = None # weight vector
self.intercept_ = None # bias term
self._theta = None # augmented weight vector, i.e., bias + weights
# this allows to treat all decision variables homogeneously
self.history = {"cost": [],
"coef": [],
"intercept": [],
"grad": []}
def _grad(self, X, y):
"""
Calculate the gradient of the objective function
Args:
X(ndarray): train objects
y(ndarray): answers for train objects
Return:
gradient(ndarray): analytical gradient vector
"""
pred = np.dot(X, self._theta)
error = pred - y
gradient = 2 * np.dot(error, X) / X.shape[0]
return gradient
# full gradient descent, i.e., not stochastic gd
def _gd(self, X, y, max_iter, alpha=0.0005):
"""
Runs GD and logs error, weigths, gradient at every step
Args:
X(ndarray): train objects
y(ndarray): answers for train objects
max_iter(int): number of weight updates
alpha(floar): step size in direction of gradient
Return:
None
"""
for i in range(max_iter):
self.history["coef"].append(self._theta[1:].copy())
self.history["intercept"].append(self._theta[0].copy())
rmse = self.score(X, y)
self.history["cost"].append(rmse)
# calculate gradient
grad = self._grad(X, y)
self.history["grad"].append(grad)
# do gradient step
self._theta -= alpha * grad
def fit(self, X, y, max_iter=1000):
"""
Public API for fitting a linear regression model
Args:
X(ndarray): train objects
y(ndarray): answers for train objects
max_iter(int): number of weight updates
Return:
self
"""
# Augment the data with the bias term.
# So we can treat the the input variables and the bias term homogeneously
# from a vectorization perspective
X = np.c_[np.ones(X.shape[0]), X]
# initialize if the first step
if self._theta is None:
self._theta = np.random.rand(X.shape[1])
# do full gradient descent
self._gd(X, y, max_iter)
self.intercept_ = self._theta[0]
self.coef_ = self._theta[1:]
return self
def score(self, X, y):
"""
Calculate RMSE metric
Args:
X(ndarray): objects
y(ndarray): answers
Return:
rmse(float): RMSE
"""
pred = self.predict(X)
error = pred - y
rmse = (np.sum(error ** 2) / X.shape[0]) ** 0.5
return rmse
def predict(self, X):
"""
Make a prediction
Args:
X(ndarray): objects
Return:
pred(ndarray): predictions
"""
# check whether X has appended bias feature or not
if X.shape[1] == len(self._theta):
pred = np.dot(X, self._theta)
else:
pred = np.dot(X, self.coef_) + self.intercept_
return pred
Create model
model_homegrown = BasicLinearRegressionHomegrown()
model_homegrown.fit(X_train, y_train, max_iter=40000)
plt.figure(figsize=(10, 8))
plt.plot(model_homegrown.history["cost"], label="Train")
plt.xlabel("Iteration")
plt.ylabel("RMSE")
plt.title("Linear Regression via Gradient Descent")
plt.legend();
models = [model_sk, model_homegrown]
models_names = ["Sklearn", "Homegrown"]
evaluate(models, metrics, samples, metrics_names, models_names)
Random search algorithm consists of the following steps:
class RandomSearchLinearRegressionHomegrown(BasicLinearRegressionHomegrown):
def __init__(self):
# call the constructor of the parent class
super(RandomSearchLinearRegressionHomegrown, self).__init__()
self.history = {"cost": [],
"coef": [],
"intercept": []}
def _rs(self, X, y, max_iter):
"""
Runs Random Search and logs error and weigths at every step
Args:
X(ndarray): train objects
y(ndarray): answers for train objects
max_iter(int): number of points to sample
Return:
None
"""
# use the following numpy command to generate
# random sample from the weight space
sampled_weights = np.random.random(size=(max_iter, len(self._theta))) * 20 - 10
# go ahead and choose the best of sampled weights
# according to the objective
#==================================================#
# Your code starts here #
#==================================================#
# error for each theta
# in this case error is a matrix with size [max_iter * number of examples]
errors =
# mse for each theta
mses =
# best theta is the one with smallest error
best_idx =
# get it!
self._theta =
# TODO - change above to
# error for each theta
# in this case error is a matrix with size [max_iter * number of examples]
# errors = ...
# mse for each theta
# mses = ...
# best theta is the one with smallest error
# best_idx = ...
# get it!
# self._theta = ...
#==================================================#
# Your code ends here #
#==================================================#
def fit(self, X, y, max_iter=1000):
"""
Public API for fitting a linear regression model
Args:
X(ndarray): train objects
y(ndarray): answers for train objects
max_iter(int): number of points to sample
Return:
self
"""
X = np.c_[np.ones(X.shape[0]), X]
if self._theta is None:
self._theta = np.random.rand(X.shape[1])
self._rs(X, y, max_iter)
self.intercept_ = self._theta[0]
self.coef_ = self._theta[1:]
return self
Create model
model_homegrown_rs = RandomSearchLinearRegressionHomegrown()
Fitting
np.random.seed(42)
model_homegrown_rs.fit(X_train, y_train, max_iter=10000)
Evaluation
models = [model_sk, model_homegrown, model_homegrown_rs]
models_names = ["Sklearn", "Homegrown Full GD", "Homegrown RS"]
evaluate(models, metrics, samples, metrics_names, models_names)
The formula for analytical gradient (from calculus):
$$ \nabla f(\mathbf{x}) = \begin{bmatrix} \frac{\partial f}{\partial x_1}\\ \vdots\\ \frac{\partial f}{\partial x_m} \end{bmatrix}, \text{ where } m \text{ is the space dimension}\\ \frac{\partial f}{\partial x_1} = \lim_{\alpha \rightarrow 0} \frac{f(x_1 + \alpha, x_2 \ldots x_m) - f(x_1, x_2 \ldots x_m)}{\alpha} $$For sufficiently small $\alpha$ one can approximate partial derivative by simple throwing out the limit operator
$$ \frac{\partial f}{\partial x_1} \approx \frac{f(x_1 + \alpha, x_2 \ldots x_m) - f(x_1, x_2 \ldots x_m)}{\alpha} = \left( \frac{\partial f}{\partial x_1} \right)_{\text{num}}\\ $$Then the final approximation of the gradient is:
$$ \nabla f(\mathbf{x}) \approx \nabla_{\text{num}\,\,} f(\mathbf{x}) = \begin{bmatrix} \left( \frac{\partial f}{\partial x_1} \right)_{\text{num}}\\ \vdots\\ \left( \frac{\partial f}{\partial x_m} \right)_{\text{num}} \end{bmatrix} $$The common way of measuring the difference between vectors is the following: $$ \text{er} = \frac{\|\nabla f(\mathbf{x}) - \nabla_{\text{num}\,\,}f(\mathbf{x})\|_2^2}{\|\nabla f(\mathbf{x})\|_2^2} = \frac{\sum_{j=1}^{m}\left(\nabla^j f(\mathbf{x}) - \nabla^j_{\text{num}\,\,}f(\mathbf{x})\right)^2}{\sum_{j=1}^{m}\left(\nabla^j f(\mathbf{x})\right)^2} $$
class TweakedLinearRegressionHomegrown(BasicLinearRegressionHomegrown):
def __init__(self):
# call the constructor of the parent class
super(TweakedLinearRegressionHomegrown, self).__init__()
self.history["grad_num"] = []
@staticmethod
def _gradient_approximation(f, x):
"""
Returns the numerical gradient of the function f at the point x
Args:
f(callable): function that takes the point x as an input
and returns the value of the function
x(ndarray): numpy array which contains the coordinates
of the point to evaluate gradient
Return:
grad_num(ndarray): the numerical approximation
of the gradient
"""
grad_num = np.zeros(len(x))
alpha = 0.001
#==================================================#
# Your code starts here #
#==================================================#
# TODO - change above to
# for i in range(len(x)):
# h = np.zeros(len(x))
# h[i] += alpha
# grad_num[i] = ...
#==================================================#
# Your code ends here #
#==================================================#
return grad_num
def _grad_num(self, X, y):
"""
Returns the numerical gradient of the LinearRegression
objective function
Args:
X(ndarray): train objects
y(ndarray): answers for train objects
Return:
grad_num(ndarray): the numerical approximation
of the gradient
"""
grad_num = np.zeros(X.shape[1])
def f(a):
#==================================================#
# Your code starts here #
#==================================================#
# TODO - change above to
# pred = ...
# error = ...
# mse = ...
# return ...
#==================================================#
# Your code ends here #
#==================================================#
grad_num = self._gradient_approximation(f, self._theta)
return grad_num
def _gd(self, X, y, max_iter, alpha=0.001):
"""
Runs GD and logs error, weigths, gradient and
numerical gradient at every step
Args:
X(ndarray): train objects
y(ndarray): answers for train objects
max_iter(int): number of EPOCHS, i.e., full passes over data
batch_size(int): number of samples in one batch
alpha(floar): step size in direction of gradient
Return:
None
"""
for i in range(max_iter):
self.history["coef"].append(self._theta[1:].copy())
self.history["intercept"].append(self._theta[0].copy())
rmse = self.score(X, y)
self.history["cost"].append(rmse)
grad = self._grad(X, y)
self.history["grad"].append(grad)
grad_num = self._grad_num(X, y)
self.history["grad_num"].append(grad_num)
self._theta -= alpha * grad
Create model
model_homegrown_check_grad = TweakedLinearRegressionHomegrown()
Fitting
np.random.seed(42)
model_homegrown_check_grad.fit(X_train, y_train, max_iter=40000)
Plotting error curves
def relative_error(grad, grad_num):
return np.sum((grad - grad_num) ** 2, axis=1) * 1. / np.sum(grad ** 2, axis=1)
def absolute_error(grad, grad_num):
return np.sum((grad - grad_num) ** 2, axis=1) * 1.
grad_num = np.array(model_homegrown_check_grad.history["grad_num"])
grad = np.array(model_homegrown_check_grad.history["grad"])
plt.figure(figsize=(20, 8))
plt.suptitle("Numerical approximation of gradient quality")
plt.subplot(121)
plt.plot(relative_error(grad, grad_num))
plt.xlabel("Iteration")
plt.ylabel("Relative error")
plt.subplot(122)
plt.plot(absolute_error(grad, grad_num))
plt.xlabel("Iteration")
plt.ylabel("Absolute error")
plt.show()
In Full GD we do a descent step only after the calculation of the gradient over the whole set of data. In this case the gradient is precise and gives the best possible direction. But it can require quite a lot of time if we have huge amounts of data.
In practice we can get faster convergence if we calculate the gradient not over the whole set of data but over the small (size of $B$) batch of it.
$$ \nabla f(\boldsymbol{\theta}) \approx \nabla_{\text{batch}\,\,} f(\boldsymbol{\theta}) = \frac{2}{n}\sum_{i=1}^{B}\left(\mathbf{x}'_{a_i}\cdot \boldsymbol{\theta} - y_{a_i}\right)\cdot \mathbf{x}'_{a_i} $$where $a_i$ is an array of indices of objects which are in this batch. Common approach here that you should use is to shuffle samples randomly and then iterate over them with batches.
So with this batch approach we get an approximation of the real gradient in point $\boldsymbol{\theta}$. This approximation is very cheap and fast to compute (usually $B$ is not too big $-$ from 32 to 256). After obtaining this gradient we do a descent step in this approximate direction and proceed to the next stage of batch descent.
class StochasticLinearRegressionHomegrown(TweakedLinearRegressionHomegrown):
def __init__(self):
# call the constructor of the parent class
super(StochasticLinearRegressionHomegrown, self).__init__()
self.history["grad_num"] = []
def _sgd(self, X, y, max_iter, batch_size, alpha=0.0005):
"""
Runs Stochastic GD and logs error, weigths, gradient and
numerical gradient at every step
Args:
X(ndarray): train objects
y(ndarray): answers for train objects
max_iter(int): number of EPOCHS, i.e., full passes over data
batch_size(int): number of samples in one batch
alpha(floar): step size in direction of gradient
Return:
None
"""
for epoch in range(max_iter):
idxs = np.random.permutation(X.shape[0])
X = X[idxs]
y = y[idxs]
#==================================================#
# Your code starts here #
#==================================================#
# TODO - change above to
# for i in range(...):
#==================================================#
# Your code ends here #
#==================================================#
self.history["coef"].append(self._theta[1:].copy())
self.history["intercept"].append(self._theta[0].copy())
rmse = self.score(X, y)
self.history["cost"].append(rmse)
# calculate gradient
grad = self._grad(X[i:i + batch_size], y[i:i + batch_size])
self.history["grad"].append(grad)
# numerical gradient
grad_num = self._grad_num(X[i:i + batch_size], y[i:i + batch_size])
self.history["grad_num"].append(grad_num)
# do gradient step
self._theta -= alpha * grad
def fit(self, X, y, max_iter=1000, batch_size=16):
"""
Public API for fitting a linear regression model
Args:
X(ndarray): train objects
y(ndarray): answers for train objects
max_iter(int): number of EPOCHS, i.e., full passes over data
batch_size(int): number of samples in one batch
Return:
self
"""
X = np.c_[np.ones(X.shape[0]), X]
if self._theta is None:
self._theta = np.random.rand(X.shape[1])
self._sgd(X, y, max_iter, batch_size)
self.intercept_ = self._theta[0]
self.coef_ = self._theta[1:]
return self
Create model
model_homegrown_sgd = StochasticLinearRegressionHomegrown()
Fitting
np.random.seed(42)
model_homegrown_sgd.fit(X_train, y_train, max_iter=1500)
plt.figure(figsize=(10, 8))
plt.plot(model_homegrown_sgd.history["cost"], label="Train")
plt.xlabel("Iteration")
plt.ylabel("RMSE")
plt.title("Linear Regression via Stochastic Gradient Descent")
plt.legend(frameon=True)
plt.show()
Evaluation
models = [model_sk, model_homegrown, model_homegrown_sgd]
models_names = ["Sklearn", "Homegrown Full GD", "Homegrown SGD"]
evaluate(models, metrics, samples, metrics_names, models_names)
Incorporate L1 and L2 regularization for the BasicLinearRegressionHomegrown class developed above. Start with L2 regularization.
class RegularizedLinearRegressionHomegrown(BasicLinearRegressionHomegrown):
def __init__(self, l1_reg=0.0, l2_reg=0.0):
# call the constructor of the parent class
super(RegularizedLinearRegressionHomegrown, self).__init__()
self.l1_reg = l1_reg
self.l2_reg = l2_reg
def _grad(self, X, y):
"""
Calculate the gradient of the objective function
with L1 and L2 regularizations
Args:
X(ndarray): train objects
y(ndarray): answers for train objects
Return:
gradient(ndarray): analytical gradient vector
"""
pred = np.dot(X, self._theta)
error = pred - y
gradient = 2 * np.dot(error, X) / X.shape[0]
# penalties only for weights
gradient[1:] += 2 * self.l2_reg * self._theta[1:] + self.l1_reg * np.sign(self._theta[1:])
return gradient
model_homegrown_regularized_l2 = RegularizedLinearRegressionHomegrown(l1_reg=0.0, l2_reg=0.1)
model_homegrown_regularized_l1 = RegularizedLinearRegressionHomegrown(l1_reg=1.0, l2_reg=0.0)
np.random.seed(42)
model_homegrown_regularized_l2.fit(X_train, y_train, max_iter=40000)
model_homegrown_regularized_l1.fit(X_train, y_train, max_iter=40000)
models = [model_sk, model_homegrown, model_homegrown_regularized_l2, model_homegrown_regularized_l1]
models_names = ["Sklearn", "Homegrown", "Homegrown Regularized L2", "Homegrown Regularized L1"]
plt.figure(figsize=(12, 8))
plt.bar(np.arange(model_homegrown.coef_.shape[0]) - 0.2, model_homegrown.coef_, width=0.2, label="No reg")
plt.bar(np.arange(model_homegrown_regularized_l2.coef_.shape[0]), model_homegrown_regularized_l2.coef_, width=0.2, label="L2")
plt.bar(np.arange(model_homegrown_regularized_l1.coef_.shape[0]) + 0.2, model_homegrown_regularized_l1.coef_, width=0.2, label="L1")
plt.xticks(np.arange(model_sk.coef_.shape[0]), X.columns, rotation='vertical')
plt.xlim([-1, model_sk.coef_.shape[0]])
plt.title("Model coefficients comparison")
plt.legend();
print("2-norm of weights:\n")
print("{:10s}{:.2f}".format("No reg:", np.linalg.norm(model_homegrown.coef_)))
print("{:10s}{:.2f}".format("L2:", np.linalg.norm(model_homegrown_regularized_l2.coef_)))
print("{:10s}{:.2f}".format("L1:", np.linalg.norm(model_homegrown_regularized_l1.coef_)))
print("Number of non-zero coefficients:\n")
print("{:10s}{:d}".format("No reg:", np.sum(np.abs(model_homegrown.coef_) > 1e-2)))
print("{:10s}{:d}".format("L2:", np.sum(np.abs(model_homegrown_regularized_l2.coef_) > 1e-2)))
print("{:10s}{:d}".format("L1:", np.sum(np.abs(model_homegrown_regularized_l1.coef_) > 1e-2)))
Instead of doing a gradient step with the fixed step size ($\alpha=0.0005$) consider it as a variable after choosing the step directon (gradient one) and try to optimize it. In other words solve analyticaly the following 1D optimization problem:
$$ f\left(\boldsymbol{\theta}_{t} - \alpha \cdot \nabla f(\boldsymbol{\theta}_{t})\right) \rightarrow \min_{\alpha} $$class OptimalStepBasicLinearRegressionHomegrown(BasicLinearRegressionHomegrown):
def __init__(self):
super(OptimalStepBasicLinearRegressionHomegrown, self).__init__()
self.history["alpha"] = []
def _gd(self, X, y, max_iter):
"""
Runs GD and logs error, weigths, gradient at every step.
Here the optimal step size used with formulas from above.
Args:
X(ndarray): train objects
y(ndarray): answers for train objects
max_iter(int): number of weight updates
Return:
None
"""
for i in range(max_iter):
self.history["coef"].append(self._theta[1:].copy())
self.history["intercept"].append(self._theta[0].copy())
rmse = self.score(X, y)
self.history["cost"].append(rmse)
# calculate gradient
grad = self._grad(X, y)
self.history["grad"].append(grad)
# optimum step size
#==================================================#
# Your code starts here #
#==================================================#
alpha = X.shape[0] / 2. * (np.linalg.norm(grad) / np.linalg.norm(X.dot(grad))) ** 2
# TODO:
# alpha =...
#==================================================#
# Your code ends here #
#==================================================#
self.history["alpha"] = alpha
# do gradient step
self._theta -= alpha * grad
model_homegrown_optimal_step = OptimalStepBasicLinearRegressionHomegrown()
np.random.seed(42)
model_homegrown_optimal_step.fit(X_train, y_train, max_iter=40000)
models = [model_sk, model_homegrown, model_homegrown_optimal_step]
models_names = ["Sklearn", "Homegrown", "Homegrown with Optimal step size"]
evaluate(models, metrics, samples, metrics_names, models_names)
plt.figure(figsize=(10, 8))
plt.plot(model_homegrown_optimal_step.history["cost"], label="Optimal step")
plt.plot(model_homegrown.history["cost"], label="Constant step")
plt.xlabel("Iteration")
plt.ylabel("RMSE")
plt.title("Linear Regression via Gradient Descent\nwith optimal step size")
plt.legend(frameon=True);